AVD Scaling

I was doing a POC around AVD in Gcc High and wanted to implement autoscale. Reading Microsoft’s documentation, the new autoscale is not in GCC High. I had to either write my own, use some 3rd party solution or leverage their original logic app/runbook solution. I opted using Microsoft’s original solution, but of course, it seems this is no longer being updated. It uses RunAs accounts which MS will retire on 9/30/2023.

I don’t understand why they just don’t update the solution to use a managed identity, but oh well. Not knowing if their new autoscale will make it to GCC High, I updated it to do a couple of things worth sharing.

The first thing we need to do is deploy the logicapp/automation account solution. It will deploy an automation account, upgrade some modules and deploy a logic app. We will want to make sure the system managed identity is enabled on the automation account and assigned the contributor role to the subscription. Opening up the runbook deployed, find

$AzAuth = Connect-AzAccount -ApplicationId $ConnectionAsset.ApplicationId -CertificateThumbprint $ConnectionAsset.CertificateThumbprint -TenantId $ConnectionAsset.TenantId -SubscriptionId $ConnectionAsset.SubscriptionId -EnvironmentName $EnvironmentName -ServicePrincipal

and replace it with

$AzAuth = Connect-AzAccount -Identity 

add -EnvironmentName AzureUSGovernment if you are hitting GCC High.

I commented out $ConnectionAsset = Get-AutomationConnection -Name $ConnectionAssetName as we aren’t using a runas account anymore. In your logic app, you can just make the request parameter empty “ConnectionAssetName”: “”

At this point, we’re using a managed identity to log in. Great, but let’s start thinking why we are using this solution. Yes, we need more VMs to satisfy user load, but it is also a cost savings tool. If the VMs will shut down at night, why pay for Premium_LRS disks? We can easily add a function that converts the disk at shutdown and startup.

I added a simple function to the runbook:

function Set-AvdDisk {
    param (
        [string]$rgName,
        $vm,
        [ValidateSet("Standard_LRS", "Premium_LRS")]
        [string]$diskSku
    )

    if ($convertDisks) {	
        if ($vm.PowerState -eq 'VM deallocated') {
            write-output "VM $($vm.Name) is deallocated, checking disk sku"
            $vmDisk = get-azdisk -ResourceGroupName $rgname -DiskName $vm.StorageProfile.OSDisk.Name
            if ($vmDisk.sku.name -ne $diskSku) {
                write-output "Changing disk sku to $diskSku on VM $($vm.Name)"
                $vmDisk.sku = [microsoft.Azure.Management.Compute.Models.DiskSku]::new($diskSku)
                $vmDisk | Update-AzDisk 
            }
        }
    }
}

I just call that function in the foreach loop when the runbook starts a VM up:

That will ensure the disk is Premium sku when starting the VM up, but what about shutdown? That is the real cost saving. At the end of the script when all jobs are completed, I just run a simple powershell script that pulls back all the deallocated vm’s and runs the function above to convert the disks back to Standard_LRS.

    Write-Log 'Convert all deallocated VMs disk to Standard Sku'
    $stoppedVms = Get-AzVm -ResourceGroupName $ResourceGroupName -status | where {$_.PowerState -eq 'VM deallocated'}
    foreach ($vm in $stoppedVms) {
        Set-AvdDisk -rgName $vm.resourcegroupname -vm $vm -diskSku 'Standard_LRS'
    }

When the runbook starts and a scale up or down action hits, you can see the result in the output of the job. The screenshot below is turning on a VM. It will change the disk sku back to Premium_LRS for that specific vm before starting it up.

Here is a screenshot of a VM being deallocated which you can see the disk being set to Standard_LRS to reduce cost.

Feel free to modify the script. I’ll honestly say that it took me 5 mins to put this together, so there is room for improvement, but this shows that the functionality will work. I’ll eventually get around to making this better, things like tag support to skip disks for some reason, but for now, enjoy.

Disclaimer: Test this before implementing.

https://github.com/jrudley/avd/blob/main/WVDAutoScaleRunbookARMBased.ps1

Azure Bastion Standard Sku Autoscale?

The standard sku of Azure Bastion fixed a lot of the pain points of the basic sku. Things like setting up multiple instances and setting the port to use for Linux. The one thing I did not see was autoscale. The Microsoft doc’s state Each instance can support 10 concurrent RDP connections and 50 concurrent SSH connections. The number of connections per instances depends on what actions you are taking when connected to the client VM. For example, if you are doing something data intensive, it creates a larger load for the instance to process. Once the concurrent sessions are exceeded, an additional scale unit (instance) is required. Imagine the scenario that we are using a hub and spoke topology with a bastion sitting in our hub. We would need to setup monitoring around concurrent sessions and alert us when session connectivity was getting close, but why not autoscale it?

I was curious why this setting was missing, so I spun up a test environment with 2 RDP sessions. Remember that the default deployment has 2 bastions deployed. Looking at the metric for session count, we can see the following:

Now, I was totally confused why it kept showing 1 to .44ish every few minutes. I understand the 1 for average since its 2 sessions across 2 instances, but couldn’t understand why it kept dipping.

Here is the graph using sum as my aggregation. Same thing! At this point, I tried to split the graph on instance:

Seems to be a scale set internally running bastion if I had to guess. That 0 on vm000000 screwing my metric count up! Now that I had an understanding of the metrics, how could I scale this automatically? I could setup an alert rule that fires a webhook when the session count is above X or below Y. I just didn’t feel comfortable with these metrics as it could provision multiple scaleset instances of 0 and I wouldn’t know. I started doing some research and found an API call for getActiveSessions https://docs.microsoft.com/en-us/rest/api/virtualnetwork/get-active-sessions/get-active-sessions which would return my session count. This is ideally what I wanted, so I started going down this path. I figured I could create an Azure function or runbook that runs every so often and scales the bastion out by +1 or -1 based on some switch.

$restUri = "https://management.azure.com/subscriptions/$((Get-AzContext).Subscription.Id)/resourceGroups/$bastionResourceGroupName/providers/Microsoft.Network/bastionHosts/$bastionHostName/getActiveSessions?api-version=2021-03-01"
$getStatus = Invoke-webrequest -UseBasicParsing -uri $restUri -Headers $authHeader -Method Post
$asyncUri = "https://management.azure.com/subscriptions/$((Get-AzContext).Subscription.Id)/providers/Microsoft.Network/locations/$bastionResourceGroupLocation/operationResults/$($getStatus.headers['x-ms-request-id'])?api-version=2020-11-01"
$sessions = invoke-restmethod -uri $asyncUri -Headers $authHeader
while ($sessions -eq 'null' ) {
    start-sleep -s 2
    $sessions = invoke-restmethod -uri $asyncUri -Headers $authHeader
}
 
write-output "Current session count is: $($sessions.count)"

The docs made it seem like this was a sync call, but it is actually async. You need to query out operation results to pull back the session count. For more information, check out this article https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/async-operations

Now that I have my session count, I could do a simple switch statement on setting my bastion instance count. I started with these numbers below:

$bastionObj = Get-AzBastion -ResourceGroupName $bastionResourceGroupName -Name $bastionHostName
switch ($sessions.count)
{
    #2 instances by default. Each can hold up to 12 sessions
    {0..22 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 2 -Force  }
    {23..34 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 3 -Force  }
    {35..45 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 4 -Force  }
    {46..58 -contains $_} {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 5 -Force  }
    Default {Set-AzBastion -InputObject $bastionObj -Sku "Standard" -ScaleUnit 2 -Force}
 
}

When I started to test the autoscale, I noticed one big problem! When setting the scaleunit count, it disconnects all sessions. That is a horrible end user experience. I am thinking this is why Microsoft did not implement autoscale 🙂

Well, next best scenario is resizing at the end of the working day to keep costs low. Add the code to authenticate into Azure via runbook or function and set it to run on a schedule. Maybe 8pm at night we resize based on user session count and before the work day starts we would resize to an instance count that fits our requirements. I’d imagine Microsoft will implement autoscale, but they need to figure out how to move existing sessions gracefully to another bastion host.