Guest Configuration Extension Broke in Azure Gov for RHEL 8.x+

UPDATE 4/11/2022 This has been fixed!

UPATE 4/3/2022 Still broke…waiting on product to fix.

UPDATE 3/4/2022 Microsoft product group will be pushing a fix out in 2 weeks to Azure Gov. I asked what the cause was, but nothing yet.

One of the great features of Azure Policy is the capability to audit OS settings for security baselines and compliance checking. I was deploying RHEL 8.4 and noticed the Guest Assignment was always hung in the pending state. I had no issues with Ubuntu, so it had to be something happening on the RHEL vm.

I navigated to /var/lib and saw the GuestConfig folder created, but when I was inside, it was empty. Hrm, this should be populated with folders and MOF files.

[root@rhel84 GuestConfig]# pwd
[root@rhel84 GuestConfig]# ls -al
total 4
drwxr--r--.  2 root root    6 Feb 26 22:13 .
drwxr-xr-x. 41 root root 4096 Feb 26 22:13 ..

Next step was to tail the messages log to see if anything can pin point what is actually happening.

[root@rhel84 GuestConfig]# tail -f /var/log/messages | grep -i GuestConfiguration
Feb 26 22:27:36 rhel84 systemd[7442]: gcd.service: Failed at step EXEC spawning /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service: Permission denied
Feb 26 22:27:46 rhel84 systemd[7458]: gcd.service: Failed at step EXEC spawning /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service: Permission denied

Alright, a permission denied. It’s something to start looking into, but I was confused why this is happening. I headed over to Azure commercial and spun up a RHEL 8.4 vm with the same Azure Policy to execute my security baseline. Well, to my surprise, everything worked just fine. Looking at /var/lib/GuestConfig showed the Configuration folder with mof files. Looking at the Guest Assignments, it was showing NonCompliant, so I know it is OK there. I did notice the Guest Extension in commercial is using 1.26.24 and gov is using 1.25.5. I tried deploying that version with no auto upgrade in gov, but same error.

After some research, I set selinux to permissive mode and instantly the Configuration folder was created and starting pulling the mof files down. OK, now I am really puzzled. Working with Azure support, they were able to reproduce this same issue in Gov, but not in commercial. I was shocked no other cases have been open. I am not sure when this problem started happening, but this means security baselines on RHEL 8.x+ are not working.

While I wait for Microsoft to investigate more why this is happening, I tried to find a workaround. Knowing it is selinux causing the issue, I thought I could just create a policy allowing the execution of the gc_linux_service.

I tested first by making sure selinux is set to Enforcing then using chcon to set the selinux context:

[root@rhel84 GuestConfig]# getenforce
chcon -t bin_t /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service

We’re all good. No error’s in the messages log. Since this could revert by a restorecon command being ran later, I added it to the selinux policy by running:

semanage fcontext -a -t bin_t /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service
restorecon -v /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service

I will update my post once Microsoft comes back with a reason why this is only happening in Azure Gov and see what proposed solution they have. For now, i’d not depend on the Guest Extension to perform your compliance checking for RHEL 8.x until a fix has been pushed.

Azure CycleCloud Slurm Scheduler CentOS Fix

Azure CycleCloud is one of those products that shines, but slowly gets the care it needs. I was deploying a Slurm Scheduler and left the defaults for the scheduler, hpc and htc operating system selection. You can see the default is CentOS 8 which has been EOL as of Dec 31st, 2021. Ubuntu is an option, but if you want to continue using CentOS 8, keep reading.

When starting the cluster up, it eventually errored out trying to install the perl-switch RPM. Looks like this package has moved.

The great thing with CycleCloud is how flexible it is. Edit the cluster and select advanced settings to set the cloud-init section. Paste the following in to use a valid repo.

    - cd /tmp
    - wget
    - yum -y install perl-Switch-2.17-10.el8.noarch.rpm

Success! The scheduler created 🙂

Now, I am sure the question you asked “why has Microsoft not updated CycleCloud?” I have no idea. Competing priorities? Hopefully, the next release will fix this, use Ubuntu in the drop down or just create your own CycleCloud template for a scheduler and select that during deployment with whatever OS image you prefer.

Azure Bastion Standard Sku Autoscale?

The standard sku of Azure Bastion fixed a lot of the pain points of the basic sku. Things like setting up multiple instances and setting the port to use for Linux. The one thing I did not see was autoscale. The Microsoft doc’s state Each instance can support 10 concurrent RDP connections and 50 concurrent SSH connections. The number of connections per instances depends on what actions you are taking when connected to the client VM. For example, if you are doing something data intensive, it creates a larger load for the instance to process. Once the concurrent sessions are exceeded, an additional scale unit (instance) is required. Imagine the scenario that we are using a hub and spoke topology with a bastion sitting in our hub. We would need to setup monitoring around concurrent sessions and alert us when session connectivity was getting close, but why not autoscale it?

I was curious why this setting was missing, so I spun up a test environment with 2 RDP sessions. Remember that the default deployment has 2 bastions deployed. Looking at the metric for session count, we can see the following:

Now, I was totally confused why it kept showing 1 to .44ish every few minutes. I understand the 1 for average since its 2 sessions across 2 instances, but couldn’t understand why it kept dipping.

Here is the graph using sum as my aggregation. Same thing! At this point, I tried to split the graph on instance:

Seems to be a scale set internally running bastion if I had to guess. That 0 on vm000000 screwing my metric count up! Now that I had an understanding of the metrics, how could I scale this automatically? I could setup an alert rule that fires a webhook when the session count is above X or below Y. I just didn’t feel comfortable with these metrics as it could provision multiple scaleset instances of 0 and I wouldn’t know. I started doing some research and found an API call for getActiveSessions which would return my session count. This is ideally what I wanted, so I started going down this path. I figured I could create an Azure function or runbook that runs every so often and scales the bastion out by +1 or -1 based on some switch.

$restUri = "$((Get-AzContext).Subscription.Id)/resourceGroups/$bastionResourceGroupName/providers/Microsoft.Network/bastionHosts/$bastionHostName/getActiveSessions?api-version=2021-03-01"
$getStatus = Invoke-webrequest -UseBasicParsing -uri $restUri -Headers $authHeader -Method Post
$asyncUri = "$((Get-AzContext).Subscription.Id)/providers/Microsoft.Network/locations/$bastionResourceGroupLocation/operationResults/$($getStatus.headers['x-ms-request-id'])?api-version=2020-11-01"
$sessions = invoke-restmethod -uri $asyncUri -Headers $authHeader
while ($sessions -eq 'null' ) {
    start-sleep -s 2
    $sessions = invoke-restmethod -uri $asyncUri -Headers $authHeader
write-output "Current session count is: $($sessions.count)"

The docs made it seem like this was a sync call, but it is actually async. You need to query out operation results to pull back the session count. For more information, check out this article

Now that I have my session count, I could do a simple switch statement on setting my bastion instance count. I started with these numbers below:

$bastionObj = Get-AzBastion -ResourceGroupName $bastionResourceGroupName -Name $bastionHostName
switch ($sessions.count)
    #2 instances by default. Each can hold up to 12 sessions
    {0..22 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 2 -Force  }
    {23..34 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 3 -Force  }
    {35..45 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 4 -Force  }
    {46..58 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 5 -Force  }
    Default {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 2 -Force}

When I started to test the autoscale, I noticed one big problem! When setting the scaleunit count, it disconnects all sessions. That is a horrible end user experience. I am thinking this is why Microsoft did not implement autoscale 🙂

Well, next best scenario is resizing at the end of the working day to keep costs low. Add the code to authenticate into Azure via runbook or function and set it to run on a schedule. Maybe 8pm at night we resize based on user session count and before the work day starts we would resize to an instance count that fits our requirements. I’d imagine Microsoft will implement autoscale, but they need to figure out how to move existing sessions gracefully to another bastion host.

Can’t add an Azure budget after a new subscription?

Create a budget automatically after provisioning a new subscription

I am sure you ran into the situation where you create a new subscription, but want to add an Azure budget to help monitor and control spend. As you know, it can take some time for the subscription to sync with the EA portal. Here is a snippet from stating If you have a new subscription, you can’t immediately create a budget or use other Cost Management features. It might take up to 48 hours before you can use all Cost Management features. I don’t want to wait or try and remember adding a budget the next day. Let’s use Azure tools to solve this problem to automatically create a budget for us.

When I first read that statement above, I was thinking how to keep track of the new subscription details and have it automatically create the budget. I thought, why not use an Azure storage queue? I can start a runbook that creates the subscription, pops a message on the queue and will try every so often to create the budget. If successful, remove the message from the queue, but if not, keep it on and retry a few hours later. Let’s take a look at a snippet of the relevant code below.

$storageAccount = get-AzStorageAccount -ResourceGroupName $resourceGroup -Name $storageAccountName 
$ctx = $storageAccount.Context
# Retrieve a specific queue
$queue = Get-AzStorageQueue –Name $queueName –Context $ctx
#create message
# Create a new message using a constructor of the CloudQueueMessage class
$queueMessage = [Microsoft.Azure.Storage.Queue.CloudQueueMessage]::new("$subName;$ownerupn")
# Add a new message to the queue

The code above is self explanatory. Get the queue information and pop a message with the subscription name and owner. We can create another runbook that runs every few hours to process messages on the queue.

$storageAccount = get-AzStorageAccount -ResourceGroupName $resourceGroup -Name $storageAccountName 
$ctx = $storageAccount.Context
$invisibleTimeout = [System.TimeSpan]::FromSeconds(60)
$queue = Get-AzStorageQueue –Name $queueName –Context $ctx

if ($queue.QueueProperties.ApproximateMessagesCount -gt 0) {
    $queueMessage = $queue.CloudQueue.GetMessageAsync($invisibleTimeout, $null, $null)
    $msg = $queueMessage.Result.AsString
    Select-AzSubscription $msg.Split(';')[0]
    New-AzConsumptionBudget -ErrorAction SilentlyContinue -ErrorVariable cmdletError -Amount 1000 -Name "$($msg.Split(';')[0])-budget" -Category Cost -TimeGrain Monthly -StartDate (Get-Date -Format yyyy-MM).ToString() -ContactEmail '', $($msg.Split(';')[1]) -NotificationKey Key1 -NotificationThreshold 90 -NotificationEnabled 

    if ($cmdletError) {
        Write-Warning "Subscription $($msg.Split(';')[0]) might still be provisioning to ea portal. Will try again in a couple of hours..."
    else {
        $queue.CloudQueue.DeleteMessageAsync($queueMessage.Result.Id, $queueMessage.Result.popReceipt)

The runbook will check if the queue has a message, process the message, select into the newly create Azure subscription and create a new budget. If it throws an exception, write a warning and keep the message on the queue to try again later. If it does create a budget, we can safely delete the message.

It’s simple and does the job. There are 10 ways to solve a challenge and this is just one of them. Hope it helps!

Reactivate an Azure Subscription via API – Gov Cloud Edition

I recently had to reactivate an Azure subscription that was cancelled, but I noticed the instructions do not work in Azure Gov Cloud. There is no button to reactivate, so I was forced to submit a ticket to Microsoft and they fixed me up. Typically, if a subscription was cancelled, it was done by mistake and the end user needs access ASAP. I didn’t want to wait hours by submitting a ticket to Microsoft in the future, so I started figuring out how I could do this self service style in Azure gov.

I started to research the AZ CLI and PowerShell cmdlets, but nothing was coming up. As a last resort, I look at the API documentation and to my surprise, I found the POST call to enable a subscription If you noticed, I linked to API version 2019-03-01-preview. The latest version of 2020-09-01 was not working in I put a code snippet below:

$azContext = Get-AzContext
$azProfile = [Microsoft.Azure.Commands.Common.Authentication.Abstractions.AzureRmProfileProvider]::Instance.Profile
$profileClient = New-Object -TypeName Microsoft.Azure.Commands.ResourceManager.Common.RMProfileClient -ArgumentList ($azProfile)
$token = $profileClient.AcquireAccessToken($azContext.Subscription.TenantId)
$authHeader = @{
    'Authorization'='Bearer ' + $token.AccessToken 

#commercial uri
#gov uri
$restUri = "$($subscriptionId)/providers/Microsoft.Subscription/enable?api-version=2019-10-01-preview"
Invoke-RestMethod -uri $restUri -Method POST -Headers $authHeader

In larger organizations, this code could be used towards Service Now automation, Azure Automation, Azure Functions, etc to get the client up and running faster. I hope this helps you with your Azure journey. 🙂

I spy, with my little eye…Encryption at Host in Azure Gov Cloud?

One of the features that has been missing from Azure gov cloud is encryption at host. The restriction of dm-crypt specific to certain Linux operating systems and the cpu overhead using bitlocker makes this a big win, not to forget federal compliances you are trying to achieve. It feels like it is some kept secret and I am not sure why? You still need to access the portal with a special link just to provision with it enabled in commercial cloud. No bicep/arm template examples and a lot of the documentation seems to be from 3rd party blogs. Well, look no further!

I published a quick arm template that enables encryption at host, but before we deploy, we need to make sure the feature is enabled. Check if it is enabled by running Get-AzProviderFeature -FeatureName "EncryptionAtHost" -ProviderNamespace "Microsoft.Compute" and if it is not registered, register it by running Register-AzProviderFeature -FeatureName "EncryptionAtHost" -ProviderNamespace "Microsoft.Compute"

Once the feature has been registered, you can create a VM using this link for gov cloud When you get to the disk section, there will be an option to enable encryption at host.

Screenshot of the virtual mahine creation disks pane, encryption at host highlighted.

Using an ARM template is as easy as adding a securityProfile with encryptionAtHost set to true

          "securityProfile": {
              "encryptionAtHost": true

For a complete sample, please go here

I haven’t seen any announcements for encryption at host for gov cloud, but then again, I don’t see many for gov cloud to begin with. Hopefully, this makes your FedRAMP and CMMC journey a little easier 🙂

Azure Run Command via API

I had a scenario where I needed an end user to be able to run a few adhoc commands via Azure automation runbook and return the results. I am a big fan of Azure Automation as it has a nice display of the jobs and how it categorizes exceptions, warnings and output. The VM is running Ubuntu, but unfortunately, you cannot run adhoc commands using the Invoke-AzVmRunCommand cmdlet. You need to pass in a script 😦 I tried to do an inline script and also export it out then reference it in the runbook, but it would just display nothing. Knowing that az cli can run adhoc commands, I figured I would research the API.

I was getting no where with the Microsoft docs as the response was not the one I was getting. One simple trick I did was run the web browser developer tools and just monitor the API call being sent from the portal. In the picture below, you can see the API call and the JSON body which has a simple command of calling date. You can copy the API call directly from the devs tools in the specific format you want.

Now that I can make the call, I noticed it is sent asynchronous. Looking at the next call in my dev tools, I saw this URI being called with some GUIDs.

I tried to research this call but I didn’t see an explanation for the guid’s in the URI. What I did figure out is that the response from invoke-webrequest has a header key called Location and azure-asyncoperation which both have a URI that matches the call Azure was using in the portal. We can do a simple while loop to wait until the invoke-webrequest populates content which has our stdout from the runcommand. It will look something like this in an Azure runbook:

Connect-AzAccount -Identity

$azContext = Get-AzContext
$azProfile = [Microsoft.Azure.Commands.Common.Authentication.Abstractions.AzureRmProfileProvider]::Instance.Profile
$profileClient = New-Object -TypeName Microsoft.Azure.Commands.ResourceManager.Common.RMProfileClient -ArgumentList ($azProfile)

$token = $profileClient.AcquireAccessToken($azContext.Subscription.TenantId)

$auth = @{
    'Content-Type'  = 'application/json'
    'Authorization' = 'Bearer ' + $token.AccessToken 

$response = Invoke-WebRequest -useBasicParsing -Uri "$($((Get-AzContext).Subscription.Id))/resourceGroups/ubuntu/providers/Microsoft.Compute/virtualMachines/ubuntu/runCommand?api-version=2018-04-01" `
-Method "POST" `
-Headers $auth `
-ContentType "application/json" `
-Body "{`"commandId`":`"RunShellScript`",`"script`":[`"date`"]}"

Foreach ($key in ($response.Headers.GetEnumerator() | Where-Object {$_.Key -eq "Location"}))
       $checkStatus = $Key.Value

$contentCheck = Invoke-WebRequest -UseBasicParsing -Uri $checkStatus -Headers $auth
while (($contentCheck.content).count -eq 0) {
$contentCheck = Invoke-WebRequest -UseBasicParsing -Uri $checkStatus -Headers $auth
Write-output "Waiting for async call to finish..."
Start-Sleep -s 15

($contentCheck.content | convertfrom-json).value.message

As you can see, I am using a managed identity and logging in with it. The runbook calls the runcommand with a POST then it hits a while loop to wait for it to finish then output the results.

Azure Kubernetes Service and Network Security Groups

One of the most common mistakes I see are people modifying the NSG rules for AKS manually instead letting AKS manage it for them. AKS is a managed service, so it will manage the rules. If the NSG rules are manually modified, AKS might reset the rules which could leave your service in a broken state or exposed to threats.

If you look at the annotations for type LoadBalancer , you can see an annotation for Typically, we would have some kind of WAF sitting in front, such as Azure Front Door. We can set the service tag AzureFrontDoor.Backend which will let AKS manage this inbound rule of only letting Azure Front Door’s ip’s communicate with this public IP.

We can do a quick example of deploying this YAML which has the service type set to LoadBalancer which will provision us a public ip.

apiVersion: apps/v1
kind: Deployment
  name: aks-helloworld-one  
  replicas: 1
      app: aks-helloworld-one
        app: aks-helloworld-one
      - name: aks-helloworld-one
        - containerPort: 80
        - name: TITLE
          value: "Welcome to Azure Kubernetes Service (AKS)"
apiVersion: v1
kind: Service
  name: aks-helloworld-one  
  type: Loadbalancer
  - port: 80
    app: aks-helloworld-one

Let’s do a kubectl apply and view the svc.

You can see a public ip has been associated with the svc. Let’s take a look at the inbound NSG. The public ip is open to the internet. I want this svc to be protected by my WAF on Azure Front Door.

In order to apply a tag correctly, let’s modify the yaml to set the correct annotation. In the picture below, I am setting the tag AzureFrontDoor.Backend which AKS will ensure it is always present and managed automatically.

Save the YAML and apply it to update the service.

Viewing the inbound NSG for AKS, we can see it automatically updated the service tag.

Remember, AKS is a managed service. Let it manage the NSGs for you!

Azure Bastion Alternatives

I had a project come up where I needed 2 factor auth and no public IP with RDP access. I instantly thought Azure Bastion would be great for this. I can use conditional access and hit my private IP VMs. Well, the VM had to be Ubuntu running Gnome desktop with xRDP. Azure Bastion is tied to the OS profile where it is SSH for Linux or RDP for Windows. There is an open feedback item to allow RDP to Linux. With all of that being said, let me present… Apache Guacamole. Nothing like presenting to executives saying let’s use Guacamole to solve our issue, haha.

I found an Azure marketplace image from Bitnami that provisions a VM with http to https redirection enabled with some dummy certificates and guacamole installed.

Once you provision the image, it has a public ip already assigned with a nsg on the nic opening ports 80, 443 and 22. I’d modify that nsg to remove port 80 and lock down port 22 to your IP or remove it and just use the serial console. Now, going back to my original requirements of 2fa, there is a saml extension you can use. We can easily create a new saml application in Azure Active Directory as well. Before we do this, we want to make sure we add a new user account with admin permissions in the format of user@aadDomain, else when we browse to the UI with our saml configured, we won’t be able to log in unless we use the API with the default guacadmin account. You can certainly use the API to create new saml accounts in Guacamole, but login first using the guacadmin creds to make it easier for testing. In order to get the default guacadmin password, look here. Make sure you change it!

Login and add a new user with admin permissions. For username, put in the fqdn of the user in AAD. Do not set a password.

Once we log in with the AAD creds, we can delete the guacadmin account.

Get on the Guacamole VM and download the saml extension, tar -xf and copy the jar inside /opt/bitnami/guacamole/extensions. When guacamole is restarted, it will automatically load the jar. We don’t want to restart just yet, as we need to configure the file with the saml entries. Let’s create a new Azure Enterprise Application and select Create your own application.

Give your app a new name and hit Create.

You will be taken to your new application which you will now select Single Sign On

Select SAML

Edit the basic configuration.

First, modify the Entity ID and Reply URL. We want to put in the FQDN where end users will access it via their browser. I have a domain I mapped to the public IP of Hit save and we need to grab the Login URL from #4

Back on the VM, edit /opt/bitnami/guacamole/ file and add these 3 lines:

saml-idp-url: login url from our enterprise saml app

saml-entity-id and saml-callback-url is our fqdn mapped to the public ip

Save this file. The last step is we need a valid certificate for our domain. I already have one and replaced the server.crt and server.key in /opt/bitnami/apache/conf/bitnami/certs. There is also a tool from Bitnami that does Let’s Encrypt for you.

Restart the required services with sudo /opt/bitnami/ restart

Now, either add the AAD user to the enterprise application or toggle user assignment required to No

Have your user navigate to the FQDN and they will be redirected to auth against AAD.

A couple of things to note. I took this project one more step where you can use an ARM template and set the secrets to a key vault with your certificate. If you have a WAF in front such as Azure Front Door, assign a custom domain name with tls and setup your AAD application to use that FQDN. I have a custom script extension that preps the VM with the steps we did above. For my project, I just pushed the ARM template to Template Specs for quick and easy provisioning.

Azure Devtest Labs “Install Windows Update” not working

I was building out a formula in Devtest labs the other day and added a few artifacts, including “Install Windows Updates”. My goal was to build an ARM template that deploys DTL, creates a VM, then a custom image off that VM. Everything worked great, but as a good IT pro, I double checked my work. Looking at the VM, you can drill down into the artifacts section and see the status of each artifact applied. I had green everywhere, so it looked good. Upon further inspection, I noticed the Windows Update task finished extremely quick.

Clicking on the task to get more details, I saw this:

It just displayed the updates and rebooted. Well, maybe it did install? I checked the update history on the VM and it was empty. It also displayed those updates to be installed. Curious, I went to the PowerShell file in the packages folder C:\Packages\Plugins\Microsoft.Compute.CustomScriptExtension\1.10.12\Downloads\4\PublicRepo\master\0b7a713c381a8cbecf04f92122d1c0b07324871e\Artifacts\windows-install-windows-updates\scripts\artifact.ps1 and saw the cmdlet:

I was able to run this script and reproduce the same output as above. I did some digging and it looks like this cmdlet needs additional parameters now unlike in the past.

I updated it and re-ran the script which displayed:

This is what I would expect to see. Now, the bigger problem is that this is in Microsoft’s artifacts repo If you are using the public repo for your artifacts and have this specific artifact being consumed, i’d double check to actually make sure you are indeed patching your operating system. I did submit a pull request with the fix, so hopefully they review it soon.

Edit: Microsoft approved my pull request to fix this. Shouldn’t be an issue now 🙂