AVD Scaling

I was doing a POC around AVD in Gcc High and wanted to implement autoscale. Reading Microsoft’s documentation, the new autoscale is not in GCC High. I had to either write my own, use some 3rd party solution or leverage their original logic app/runbook solution. I opted using Microsoft’s original solution, but of course, it seems this is no longer being updated. It uses RunAs accounts which MS will retire on 9/30/2023.

I don’t understand why they just don’t update the solution to use a managed identity, but oh well. Not knowing if their new autoscale will make it to GCC High, I updated it to do a couple of things worth sharing.

The first thing we need to do is deploy the logicapp/automation account solution. It will deploy an automation account, upgrade some modules and deploy a logic app. We will want to make sure the system managed identity is enabled on the automation account and assigned the contributor role to the subscription. Opening up the runbook deployed, find

$AzAuth = Connect-AzAccount -ApplicationId $ConnectionAsset.ApplicationId -CertificateThumbprint $ConnectionAsset.CertificateThumbprint -TenantId $ConnectionAsset.TenantId -SubscriptionId $ConnectionAsset.SubscriptionId -EnvironmentName $EnvironmentName -ServicePrincipal

and replace it with

$AzAuth = Connect-AzAccount -Identity 

add -EnvironmentName AzureUSGovernment if you are hitting GCC High.

I commented out $ConnectionAsset = Get-AutomationConnection -Name $ConnectionAssetName as we aren’t using a runas account anymore. In your logic app, you can just make the request parameter empty “ConnectionAssetName”: “”

At this point, we’re using a managed identity to log in. Great, but let’s start thinking why we are using this solution. Yes, we need more VMs to satisfy user load, but it is also a cost savings tool. If the VMs will shut down at night, why pay for Premium_LRS disks? We can easily add a function that converts the disk at shutdown and startup.

I added a simple function to the runbook:

function Set-AvdDisk {
    param (
        [ValidateSet("Standard_LRS", "Premium_LRS")]

    if ($convertDisks) {	
        if ($vm.PowerState -eq 'VM deallocated') {
            write-output "VM $($vm.Name) is deallocated, checking disk sku"
            $vmDisk = get-azdisk -ResourceGroupName $rgname -DiskName $vm.StorageProfile.OSDisk.Name
            if ($vmDisk.sku.name -ne $diskSku) {
                write-output "Changing disk sku to $diskSku on VM $($vm.Name)"
                $vmDisk.sku = [microsoft.Azure.Management.Compute.Models.DiskSku]::new($diskSku)
                $vmDisk | Update-AzDisk 

I just call that function in the foreach loop when the runbook starts a VM up:

That will ensure the disk is Premium sku when starting the VM up, but what about shutdown? That is the real cost saving. At the end of the script when all jobs are completed, I just run a simple powershell script that pulls back all the deallocated vm’s and runs the function above to convert the disks back to Standard_LRS.

    Write-Log 'Convert all deallocated VMs disk to Standard Sku'
    $stoppedVms = Get-AzVm -ResourceGroupName $ResourceGroupName -status | where {$_.PowerState -eq 'VM deallocated'}
    foreach ($vm in $stoppedVms) {
        Set-AvdDisk -rgName $vm.resourcegroupname -vm $vm -diskSku 'Standard_LRS'

When the runbook starts and a scale up or down action hits, you can see the result in the output of the job. The screenshot below is turning on a VM. It will change the disk sku back to Premium_LRS for that specific vm before starting it up.

Here is a screenshot of a VM being deallocated which you can see the disk being set to Standard_LRS to reduce cost.

Feel free to modify the script. I’ll honestly say that it took me 5 mins to put this together, so there is room for improvement, but this shows that the functionality will work. I’ll eventually get around to making this better, things like tag support to skip disks for some reason, but for now, enjoy.

Disclaimer: Test this before implementing.


Missing MSI when using the Microsoft HPC Pack in CycleCloud 8.2

I had a project that required HPC pack, so I went right to CycleCloud to provision my new cluster. When I came to the Secrets and Certificate section, I was going to use a Key Vault to store my certificate and password. After selecting the MSI Identity dropdown, it was empty.

I started looking in the documentation and found https://learn.microsoft.com/en-us/azure/cyclecloud/hpcpack?view=cyclecloud-8#azure-user-assigned-managed-identity which stated you must use a user assigned managed identity with GET for Secret and Certificate. I had that assigned already and still no dice.

Using KeyVault is a hard requirement as I don’t want to be passing PFX and passwords into CycleCloud. After unsuccessfully trying to figure out why my MSI is MIA, I popped a support ticket. After explaining the situation, the rep was able to reproduce it, but eventually came back with a solution. CycleCloud runs a job every hour that will discover new Azure resources. We can force this update by ssh’ing into the CycleCloud node and as root run the following command:

/opt/cycle_server/cycle_server  run_action Run:Application.Timer -eq Name plugin.azure.monitor_reference

Select your Subscription in the CycleCloud UI and click the Tasks tab. You will see a task running that is collecting reference data from Azure.

Once this task is finished, navigate back to creating the HPC Pack cluster and the dropdown should populate the managed identity.

Either have patience or run the command above to force the discovery. 🙂

The Mysterious Case of Azure CycleCloud Jetpack Install Error

I’ve been doing a lot of HPC work using Azure CycleCloud. It can quickly deploy an entire HPC cluster in minutes with the benefit of autoscale on the compute node side. I will have a couple of posts about some things that gave me grey hair, but let’s first look at the Jetpack installation error that CycleCloud kept showing me.

I know that CycleCloud makes use of Jetpack after doing some research and was totally confused why Jetpack would be throwing an error about Alma Linux not being supported when it is the default image used for a Slurm deployment. I was trying to find some pattern, so I would delete the Slurm cluster and reprovision it. I SSH’ed into the node and verified jetpack installed just fine from the /opt/cycle logs dir. I started looking around and noticed a custom script extension is used to start the install of something.

Drilling down into the cse, you can see the output is referencing jetpackd.

At this point, on my new cluster, everything is green. No errors, so I waited for it to throw the jetpack install error again. About 30 minutes in, the error came back. Totally confused, I started looking at the logs again. I browsed to the package install path on the node and noticed that the CSE was replaced with a CSE I automatically deploy. I started thinking, is there some kind of dependency on the package the CSE initially downloads onto the VM? Well, after reviewing the logs of my CSE that is deployed, guess what, it said Fatal error: Unrecognized OS: AlmaLinux. The The issues field in the CycleCloud UI is reporting the output of the CSE. Jetpack installation error is misleading because Jetpack installed just fine. It was Crowdstrike which does not like Alma Linux. I switched over to CentOS and the error went away. Long story short, if you deploy CSE’s, know that CycleCloud will display the stderr if a problem happens and it’ll fall under a Jetpack installation error.

Migrating a CentOS Virtual Appliance to Azure

I had an interesting ask to migrate an on premise virtual appliance to Azure. I thought this would be straight forward, but here I am writing why this situation wasn’t, lol.

This particular operating system was CentOS 7. The vendor said it is supported to migrate to Azure and I found that Microsoft has documentation on how to prep a VM for Azure here. I followed all the steps except deprovisioning as I was migrating a single VM. Once I shutdown the VM, I used PowerShell to upload the vhd. I recommend using PowerShell since it handles making sure the virtual size of the vhd is aligned to 1 MiB. Azure Storage Explorer does not do this, fyi.

connect-azaccount -Environment azureusgovernment
select-azsubscription <sub>
$resourceGroup = 'rg-<project>'
new-azresourcegroup $resourceGroup -location 'usgovvirginia'
$path = 'C:\Temp\export\va\Virtual Hard Disks\va.vhd'
$location = 'usgovvirginia'
$name = 'va-os-disk'
Add-AzVhd -LocalFilePath $path -ResourceGroupName $resourceGroup -Location $location -DiskName $name -OsType Linux -Verbose 

After running the above command, the disk will be uploaded and you can create a new VM off of it. Once I did all of that, I tried to SSH into the VM and it was rejecting my credentials. I was confused, so I restored the VM into Hyper-V and I had the same issue.

Knowing it is a SSH issue, I looked at the sshd_config file and noticed PasswordAuth was set to no. The appliance had this set to Yes, so something changed it during VM prep.

After some trial and error, I narrowed it down to this part:

Now that I know what step it is happening, I accessed the terminal instead of using my ssh client to change passwordAuthentication back to yes which fixed it. Re-uploaded and I was able to SSH into the VM. But wait, that’s not all.

The Linux agent was not in a ready state. I cat /var/log/waagent.log and saw Waiting for cloud-init to copy ovf-env.xml to /var/lib/waagent/ovf-env.xml

That file did not exist in /var/lib/waagent and was lost why it wasn’t there. After reading a couple open GitHub issues about this, I found a post https://thomasthornton.cloud/2020/04/16/cloud-init-does-not-appear-to-be-running-error-after-installing-walinuxagent/ that had the same issue I had. The fix is just to manually create the ovf-env.xml file and edit the username and hostname. I did exactly that, restarted the waagent service and everything automagically started working.

Azure Windows Security Baseline

I was designing a deployment around Azure Virtual Desktop utilizing Azure Active Directory, not AADDS or ADDS and when checking a test deploy for compliance against the NIST 800-171 Azure Policy, it showed the Azure Baseline is not being met. In a domain, I wouldn’t worry since group policy will fix this right up, but what about non domain join? What about custom images? Yeah, I guess we could manually set everything then image it, but I prefer a clean base then apply configuration during my image build. Let’s take a look how to hit this compliance checkbox.

I recalled that Microsoft released STIG templates and found the blog post Announcing Azure STIG solution templates to accelerate compliance for DoD – Azure Government (microsoft.com). I was hoping their efforts would make my life a little bit easier, but after a test deploy, I saw 33 items still not in compliance.

Looking at the workflow, it is ideally how i’d like my image process to look in my pipeline.

Deploy a baseline image, apply some scripts and then I can generate a custom image to a shared gallery for use. I didn’t want to reinvent the wheel, so I started researching if anyone has done this already. I found a repo https://github.com/Cloudneeti/os-harderning-scripts/ that looked promising, but it was a year old and I noticed some things incorrect with the script such as incorrect registry paths, commented out DSC snippets, etc. This did do a good bulk, but just needed cleaned up and things added. Looking at the commented code, it was around user rights assignments. Now, the DSC module for user right assessments is old and I haven’t seen a commit in there for years. Playing around, it seems that some settings can not be set. I didn’t want to hack together stuff using secedit, so I found a neat script https://blakedrumm.com/blog/set-and-check-user-rights-assignment/ that I could just pass in the required rights and move on. Everything worked except for SeDenyRemoteInteractiveLogonRight. When the right doesn’t exist in the exported config, it couldn’t add it. So, I just wrote the snippet to add the last right.

$tempFolderPath = Join-Path $Env:Temp $(New-Guid)
New-Item -Type Directory -Path $tempFolderPath | Out-Null
secedit.exe /export /cfg $tempFolderPath\security-policy.inf

#get line number
$file = gci -literalpath "$tempFolderPath\security-policy.inf" -rec | % {
$line = Select-String -literalpath $_.fullname -pattern "Privilege Rights" | select -ExpandProperty LineNumber

#add string
$fileContent = Get-Content "$tempFolderPath\security-policy.inf"
$fileContent[$line-1] += "`nSeDenyRemoteInteractiveLogonRight = *S-1-5-32-546"
$fileContent | out-file "$tempFolderPath\security-policy.inf" -Encoding unicode

secedit.exe /configure /db c:\windows\security\local.sdb /cfg "$tempFolderPath\security-policy.inf"
rm -force "$tempFolderPath\security-policy.inf" -confirm:$false

After running PowerShell DSC and script, the Azure baseline comes back fully compliant. I have tested this on Windows Server 2019 and Windows 10.

You can grab the files in my repo https://github.com/jrudley/azurewindowsbaseline

AKS Double Encryption

I have been living in a world of compliance these past few weeks, specifically NIST 800-171. Azure provides an initiative for NIST and one of the checks is to make sure your disks have both a platform and customer managed key. I recently ran into a scenario where you have an application that is StatefulSet in Azure Kubernetes. Let’s talk a bit more around this and NIST.

The Azure Policy was in non compliance for my disks because they were just a managed platform key. Researching the AKS docs, I found an article for using a customer managed key, but this still is not what I need as I need double encryption to meet compliance. After some research in the Kubernetes SIGs repo, I found the Azure Disk CSI driver doc and check it out:

It looks like this document was modified back in May adding support for this feature, so recently new. Upgrade the driver to 1.18 or above and double encryption support should be there.

To implement, create a new storage class that references your disk encryption set id with double encryption.

kind: StorageClass
apiVersion: storage.k8s.io/v1  
  name: byok-double-encrpytion
provisioner: disk.csi.azure.com 
reclaimPolicy: Retain
allowVolumeExpansion: true
  skuname: Premium_LRS
  kind: managed
  diskEncryptionType: EncryptionAtRestWithPlatformAndCustomerKeys
  diskEncryptionSetID: "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/diskEncryptionSets/<dek-name>"

Apply this snippet above and reference the storage class in your deployment yaml to have double encryption. This will tick that NIST compliance checkbox for AKS disks.

AAD Conditional Access What If bug

I wanted to just do a quick post about a bug I discovered in my GCC High tenant. I was recently testing out an access policy to enforce a terms of use prompt. I targeted the policy against a test group and when using the what if tool, it kept showing that none of my users in the group were getting the policy applied.

I was going absolutely nuts trying to figure out what I did wrong configuring this policy. In disbelief, I tried logging in with the user against the specific cloud app and sure enough, the TOS came up. I went back to the what if tool and it kept saying that the policy would not be applied. I thought maybe it was something to do with the TOS and switched it over to MFA in my CA policy. Same issue 😦 The only thing I thought of was that it had something to do with the group. I set the user in the group specifically on the CA policy and bingo, the what if tool worked perfectly.

I starting googling at github for this specific issue, but I could not find any. A quick CSS ticket with some emails back and forth has shown this is a bug and will be fixed, but no hard ETA other than this year. So, if you want to use the what if, make sure to assign the specific user and not depend on the group for your testing. I hope google indexes this page to save you the frustration and time wasted that happened to me 🙂

Missing Microsoft Applications in GCC High

An awesome feature to bring some sanity to Azure VM authentication and authorization is using Microsoft Azure Windows and Linux Virtual Machine Sign-in functionality. You can quickly test this by selecting the Login with Azure AD check box during provisioning.

I wanted to add MFA and User Sign In risk checks using conditional access before a user can actually log into the VM. When setting up my policy, I could not find Microsoft Azure Windows Virtual Machine Sign-in or Microsoft Azure Linux Virtual Machine Sign-in app. I was puzzled, so I quickly checked my commercial tenant and sure enough it existed. I initially thought it was one of those not in gov cloud, but only commercial cloud situation. I created a ticket to support and they came back noting that they have seen Microsoft applications missing in GCC High tenants. The quick fix is just to manually add the missing applications. Once they told me the application Id’s are the same, we can quickly just create it.

New-AzureADServicePrincipal -AppId '372140e0-b3b7-4226-8ef9-d57986796201' #Microsoft Azure Windows Virtual Machine Sign-in
New-AzureADServicePrincipal -AppId 'ce6ff14a-7fdc-4685-bbe0-f6afdfcfa8e0' #Microsoft Azure Linux Virtual Machine Sign-In

After running those PowerShell cmdlet’s in my cloud shell, I can now successfully see the apps during conditional access creation.

Guest Configuration Extension Broke in Azure Gov for RHEL 8.x+

UPDATE 4/11/2022 This has been fixed!

UPATE 4/3/2022 Still broke…waiting on product to fix.

UPDATE 3/4/2022 Microsoft product group will be pushing a fix out in 2 weeks to Azure Gov. I asked what the cause was, but nothing yet.

One of the great features of Azure Policy is the capability to audit OS settings for security baselines and compliance checking. I was deploying RHEL 8.4 and noticed the Guest Assignment was always hung in the pending state. I had no issues with Ubuntu, so it had to be something happening on the RHEL vm.

I navigated to /var/lib and saw the GuestConfig folder created, but when I was inside, it was empty. Hrm, this should be populated with folders and MOF files.

[root@rhel84 GuestConfig]# pwd
[root@rhel84 GuestConfig]# ls -al
total 4
drwxr--r--.  2 root root    6 Feb 26 22:13 .
drwxr-xr-x. 41 root root 4096 Feb 26 22:13 ..

Next step was to tail the messages log to see if anything can pin point what is actually happening.

[root@rhel84 GuestConfig]# tail -f /var/log/messages | grep -i GuestConfiguration
Feb 26 22:27:36 rhel84 systemd[7442]: gcd.service: Failed at step EXEC spawning /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service: Permission denied
Feb 26 22:27:46 rhel84 systemd[7458]: gcd.service: Failed at step EXEC spawning /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service: Permission denied

Alright, a permission denied. It’s something to start looking into, but I was confused why this is happening. I headed over to Azure commercial and spun up a RHEL 8.4 vm with the same Azure Policy to execute my security baseline. Well, to my surprise, everything worked just fine. Looking at /var/lib/GuestConfig showed the Configuration folder with mof files. Looking at the Guest Assignments, it was showing NonCompliant, so I know it is OK there. I did notice the Guest Extension in commercial is using 1.26.24 and gov is using 1.25.5. I tried deploying that version with no auto upgrade in gov, but same error.

After some research, I set selinux to permissive mode and instantly the Configuration folder was created and starting pulling the mof files down. OK, now I am really puzzled. Working with Azure support, they were able to reproduce this same issue in Gov, but not in commercial. I was shocked no other cases have been open. I am not sure when this problem started happening, but this means security baselines on RHEL 8.x+ are not working.

While I wait for Microsoft to investigate more why this is happening, I tried to find a workaround. Knowing it is selinux causing the issue, I thought I could just create a policy allowing the execution of the gc_linux_service.

I tested first by making sure selinux is set to Enforcing then using chcon to set the selinux context:

[root@rhel84 GuestConfig]# getenforce
chcon -t bin_t /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service

We’re all good. No error’s in the messages log. Since this could revert by a restorecon command being ran later, I added it to the selinux policy by running:

semanage fcontext -a -t bin_t /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service
restorecon -v /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service

I will update my post once Microsoft comes back with a reason why this is only happening in Azure Gov and see what proposed solution they have. For now, i’d not depend on the Guest Extension to perform your compliance checking for RHEL 8.x until a fix has been pushed.

Azure Bastion Standard Sku Autoscale?

The standard sku of Azure Bastion fixed a lot of the pain points of the basic sku. Things like setting up multiple instances and setting the port to use for Linux. The one thing I did not see was autoscale. The Microsoft doc’s state Each instance can support 10 concurrent RDP connections and 50 concurrent SSH connections. The number of connections per instances depends on what actions you are taking when connected to the client VM. For example, if you are doing something data intensive, it creates a larger load for the instance to process. Once the concurrent sessions are exceeded, an additional scale unit (instance) is required. Imagine the scenario that we are using a hub and spoke topology with a bastion sitting in our hub. We would need to setup monitoring around concurrent sessions and alert us when session connectivity was getting close, but why not autoscale it?

I was curious why this setting was missing, so I spun up a test environment with 2 RDP sessions. Remember that the default deployment has 2 bastions deployed. Looking at the metric for session count, we can see the following:

Now, I was totally confused why it kept showing 1 to .44ish every few minutes. I understand the 1 for average since its 2 sessions across 2 instances, but couldn’t understand why it kept dipping.

Here is the graph using sum as my aggregation. Same thing! At this point, I tried to split the graph on instance:

Seems to be a scale set internally running bastion if I had to guess. That 0 on vm000000 screwing my metric count up! Now that I had an understanding of the metrics, how could I scale this automatically? I could setup an alert rule that fires a webhook when the session count is above X or below Y. I just didn’t feel comfortable with these metrics as it could provision multiple scaleset instances of 0 and I wouldn’t know. I started doing some research and found an API call for getActiveSessions https://docs.microsoft.com/en-us/rest/api/virtualnetwork/get-active-sessions/get-active-sessions which would return my session count. This is ideally what I wanted, so I started going down this path. I figured I could create an Azure function or runbook that runs every so often and scales the bastion out by +1 or -1 based on some switch.

$restUri = "https://management.azure.com/subscriptions/$((Get-AzContext).Subscription.Id)/resourceGroups/$bastionResourceGroupName/providers/Microsoft.Network/bastionHosts/$bastionHostName/getActiveSessions?api-version=2021-03-01"
$getStatus = Invoke-webrequest -UseBasicParsing -uri $restUri -Headers $authHeader -Method Post
$asyncUri = "https://management.azure.com/subscriptions/$((Get-AzContext).Subscription.Id)/providers/Microsoft.Network/locations/$bastionResourceGroupLocation/operationResults/$($getStatus.headers['x-ms-request-id'])?api-version=2020-11-01"
$sessions = invoke-restmethod -uri $asyncUri -Headers $authHeader
while ($sessions -eq 'null' ) {
    start-sleep -s 2
    $sessions = invoke-restmethod -uri $asyncUri -Headers $authHeader
write-output "Current session count is: $($sessions.count)"

The docs made it seem like this was a sync call, but it is actually async. You need to query out operation results to pull back the session count. For more information, check out this article https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/async-operations

Now that I have my session count, I could do a simple switch statement on setting my bastion instance count. I started with these numbers below:

$bastionObj = Get-AzBastion -ResourceGroupName $bastionResourceGroupName -Name $bastionHostName
switch ($sessions.count)
    #2 instances by default. Each can hold up to 12 sessions
    {0..22 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 2 -Force  }
    {23..34 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 3 -Force  }
    {35..45 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 4 -Force  }
    {46..58 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 5 -Force  }
    Default {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 2 -Force}

When I started to test the autoscale, I noticed one big problem! When setting the scaleunit count, it disconnects all sessions. That is a horrible end user experience. I am thinking this is why Microsoft did not implement autoscale 🙂

Well, next best scenario is resizing at the end of the working day to keep costs low. Add the code to authenticate into Azure via runbook or function and set it to run on a schedule. Maybe 8pm at night we resize based on user session count and before the work day starts we would resize to an instance count that fits our requirements. I’d imagine Microsoft will implement autoscale, but they need to figure out how to move existing sessions gracefully to another bastion host.