Azure CycleCloud, Slurm and Star-CCM+ v17

I am a huge fan of Azure CycleCloud. It makes administration tasks so much easier. Now, there are some cons of running it in Gov Cloud, such as the core-hours does not work, but that isn’t much of a big deal if you deploy to a subscription and just calculate the costs from that. Let’s talk about using Slurm and Star-CCM. Microsoft published a whitepaper back in 2020 using a PBS cluster and version 14 of Star-CCM. I prefer Slurm and version 17 is out for Star-CCM, so let’s look at a more updated tutorial.

Microsoft recently updated CycleCloud to version 8.3 last month. You can find the releases notes here. Search the Azure Marketplace for CycleCloud and provision an 8.3 cluster. One thing to note and this sort of makes me question some default provisioning, but by default, it’ll give a public IP to your cluster. One of the mystery’s of CycleCloud is using this on an internal network, not exposed to the internet. Most orgs have a S2S vpn or express route into their vnet. For my demo, I am deploying into a spoke that has a peered hub connected back to on premise. When you provision the cluster, do not select a public IP and it’ll assign a private ip. More about this later when we deploy Slurm.

Once your cluster is up, navigate to https://<privateIp&gt; and add your name, ssh key and subscription details. These steps are documented here. Create a new cluster and select Slurm within the CycleCloud UI. I called my cluster slurm-test


For required settings, I gave my scheduler and login node a VM type of Standard_B4ms. For my HPC and HTC VM sku, I am using an HC44rs. Since this is 44 cores per vm, I wanted a total of 176 cores which would give me 4 vm’s that I can submit jobs to.

In Required Settings, make note of your Default NFS Share. I provisioned 256GB since I have a rough idea of what my resulting sim files will be. The thing to note is that I had one heck of a time trying to expand this disk once the scheduler was provisioned. When you terminate the scheduler, it does not delete this disk. When you go to start your scheduler back up, it re-attaches this disk. I could resize it in Azure, but AlmaLinux was not seeing the new size. Maybe I was just doing something wrong, but I resized numerous disks in RHEL and CentOS in Azure with no issue.

In the Advance settings screen is where I want to talk about private networking. Uncheck Name as Hostname. This will write all the hostnames into its hosts file locally, so all the compute nodes and scheduler can talk together. My Vnet is using custom DNS and not Azure, so these nodes need a way to resolve to each other, hence unchecking that checkbox. Also, uncheck all the Advanced Networking checkboxes. This cluster will only be accessible from on premise and not the internet.

In the cloud-init configuration, let’s add some things for the scheduler and HPC. I am not using HTC, login nodes or an HA scheduler, so I am skipping using cloud init there.

#cloud-config
runcmd:
   - yum install -y libSM libX11 libXext libXt libnsl.x86_64 git mpich
   - wget https://aka.ms/downloadazcopy-v10-linux
   - tar -xvf downloadazcopy-v10-linux
   - sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
   - echo "/dev/mapper/vg_cyclecloud_builtinsched-lv0   /sched   xfs defaults 0 0" >> /etc/fstab
   - echo "/dev/mapper/vg_cyclecloud_builtinshared-lv0  /shared  xfs defaults 0 0" >> /etc/fstab 
#cloud-config
runcmd:
   - yum install -y libSM libX11 libXext libXt libnsl.x86_64 git mpich

The initial cloud init installs some requires packages I need for StarCCM and pulling some git files down. The thing to note are the 2 lines that write into the /etc/fstab. This is a bug and no idea why this wasn’t fixed in the 8.3 release. If you don’t add these lines in, when rebooting the scheduler, you won’t be able to SSH to it. Microsoft confirmed this for me via a support case.

At this point, we are ready to start our Slurm cluster. Hit the Start button and give it about 5-10 mins to provision a VM and install the software.

Once the scheduler is running, highlight it and select Connect to get your SSH command given out to you prepopulated with your username and IP. Once connected, copy down your Star-CCM file. I am using a managed identity from my scheduler to connect to a storage account. In my cloud init script, I install azcopy to make life easier to get files on and off the scheduler.

Before I install Star-CCM, I navigated to /shared and created an apps folder to hold my Star-CCM install and a data folder to hold my job files. Assign permissions to those 2 folders with the account you plan to execute jobs under (If you plan to run under root, which you shouldn’t, you’ll need to add 2 env var’s which Slurm will complain and tell you after doing a sbatch) With that, we can then untar the file and run a single command to silently install.

 sudo ./STAR-CCM+17.04.008_01_linux-x86_64-2.17_gnu9.2-r8.sh -i silent -DPRODUCTEXCELLENCEPROGRAM=0 -DINSTALLDIR=/shared/apps -DINSTALLFLEX=false -DADDSYSTEMPATH=true -DNODOC=true

I have a license server already, so I am not installing it hence the installflex=false.

Alright, at this point, Star-CCM should be installed. Copy your .sim file to your scheduler. We need to create a sbatch script. I’ll provide a template here:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=44
#SBATCH --output=output.%j.starccm=demo
#SBATCH --time=03:00:00
#SBATCH --job-name=demo_test
module purge
module load mpi

INSTALL_DIR="/shared/apps"
DATA_DIR="/shared/data"
CASE="test.sim"
STARCCM_VERSION="17.04.008-R8"

export PATH=$INSTALL_DIR/$STARCCM_VERSION/STAR-CCM+$STARCCM_VERSION/star/bin:$PATH

starccm+  -batch -power -mpi openmpi -bs slurm -licpath 1999@myLicenseServer.full.dns.name $DATA_DIR/$CASE

To sum it up, I am telling Slurm to run my submission on 1 node and use 44 cores available on that node. Give it a max runtime of 3 hours and call the job demo_test. The magic is executing starccm+ and specifying our batch system as slurm. That’ll take our #SBATCH settings above and use them.

Now, we can run a sbatch slurmtest.sh. We can check our job submission with squeue and see which node is being spin up. We can also check in the CycleCloud UI

Once the actual vmss instance is started, it’ll create a log file on the scheduler that we set in our sbatch output. If we cat that file, you can verify we have all 44 cores being used.

Once the job is finished, it’ll save it in the same location where the case file was. It’ll be displayed in the output file as well.

What is really cool is that once the job finishes, CycleCloud will spin down the HC44rs vmss instance it created, so we aren’t paying for these large HPC vm’s when not using it. StarCCM pairs really well with Slurm and CycleCloud. I highly recommend it!

AVD Scaling

I was doing a POC around AVD in Gcc High and wanted to implement autoscale. Reading Microsoft’s documentation, the new autoscale is not in GCC High. I had to either write my own, use some 3rd party solution or leverage their original logic app/runbook solution. I opted using Microsoft’s original solution, but of course, it seems this is no longer being updated. It uses RunAs accounts which MS will retire on 9/30/2023.

I don’t understand why they just don’t update the solution to use a managed identity, but oh well. Not knowing if their new autoscale will make it to GCC High, I updated it to do a couple of things worth sharing.

The first thing we need to do is deploy the logicapp/automation account solution. It will deploy an automation account, upgrade some modules and deploy a logic app. We will want to make sure the system managed identity is enabled on the automation account and assigned the contributor role to the subscription. Opening up the runbook deployed, find

$AzAuth = Connect-AzAccount -ApplicationId $ConnectionAsset.ApplicationId -CertificateThumbprint $ConnectionAsset.CertificateThumbprint -TenantId $ConnectionAsset.TenantId -SubscriptionId $ConnectionAsset.SubscriptionId -EnvironmentName $EnvironmentName -ServicePrincipal

and replace it with

$AzAuth = Connect-AzAccount -Identity 

add -EnvironmentName AzureUSGovernment if you are hitting GCC High.

I commented out $ConnectionAsset = Get-AutomationConnection -Name $ConnectionAssetName as we aren’t using a runas account anymore. In your logic app, you can just make the request parameter empty “ConnectionAssetName”: “”

At this point, we’re using a managed identity to log in. Great, but let’s start thinking why we are using this solution. Yes, we need more VMs to satisfy user load, but it is also a cost savings tool. If the VMs will shut down at night, why pay for Premium_LRS disks? We can easily add a function that converts the disk at shutdown and startup.

I added a simple function to the runbook:

function Set-AvdDisk {
    param (
        [string]$rgName,
        $vm,
        [ValidateSet("Standard_LRS", "Premium_LRS")]
        [string]$diskSku
    )

    if ($convertDisks) {	
        if ($vm.PowerState -eq 'VM deallocated') {
            write-output "VM $($vm.Name) is deallocated, checking disk sku"
            $vmDisk = get-azdisk -ResourceGroupName $rgname -DiskName $vm.StorageProfile.OSDisk.Name
            if ($vmDisk.sku.name -ne $diskSku) {
                write-output "Changing disk sku to $diskSku on VM $($vm.Name)"
                $vmDisk.sku = [microsoft.Azure.Management.Compute.Models.DiskSku]::new($diskSku)
                $vmDisk | Update-AzDisk 
            }
        }
    }
}

I just call that function in the foreach loop when the runbook starts a VM up:

That will ensure the disk is Premium sku when starting the VM up, but what about shutdown? That is the real cost saving. At the end of the script when all jobs are completed, I just run a simple powershell script that pulls back all the deallocated vm’s and runs the function above to convert the disks back to Standard_LRS.

    Write-Log 'Convert all deallocated VMs disk to Standard Sku'
    $stoppedVms = Get-AzVm -ResourceGroupName $ResourceGroupName -status | where {$_.PowerState -eq 'VM deallocated'}
    foreach ($vm in $stoppedVms) {
        Set-AvdDisk -rgName $vm.resourcegroupname -vm $vm -diskSku 'Standard_LRS'
    }

When the runbook starts and a scale up or down action hits, you can see the result in the output of the job. The screenshot below is turning on a VM. It will change the disk sku back to Premium_LRS for that specific vm before starting it up.

Here is a screenshot of a VM being deallocated which you can see the disk being set to Standard_LRS to reduce cost.

Feel free to modify the script. I’ll honestly say that it took me 5 mins to put this together, so there is room for improvement, but this shows that the functionality will work. I’ll eventually get around to making this better, things like tag support to skip disks for some reason, but for now, enjoy.

Disclaimer: Test this before implementing.

https://github.com/jrudley/avd/blob/main/WVDAutoScaleRunbookARMBased.ps1

Missing MSI when using the Microsoft HPC Pack in CycleCloud 8.2

I had a project that required HPC pack, so I went right to CycleCloud to provision my new cluster. When I came to the Secrets and Certificate section, I was going to use a Key Vault to store my certificate and password. After selecting the MSI Identity dropdown, it was empty.

I started looking in the documentation and found https://learn.microsoft.com/en-us/azure/cyclecloud/hpcpack?view=cyclecloud-8#azure-user-assigned-managed-identity which stated you must use a user assigned managed identity with GET for Secret and Certificate. I had that assigned already and still no dice.

Using KeyVault is a hard requirement as I don’t want to be passing PFX and passwords into CycleCloud. After unsuccessfully trying to figure out why my MSI is MIA, I popped a support ticket. After explaining the situation, the rep was able to reproduce it, but eventually came back with a solution. CycleCloud runs a job every hour that will discover new Azure resources. We can force this update by ssh’ing into the CycleCloud node and as root run the following command:

/opt/cycle_server/cycle_server  run_action Run:Application.Timer -eq Name plugin.azure.monitor_reference

Select your Subscription in the CycleCloud UI and click the Tasks tab. You will see a task running that is collecting reference data from Azure.

Once this task is finished, navigate back to creating the HPC Pack cluster and the dropdown should populate the managed identity.

Either have patience or run the command above to force the discovery. 🙂

The Mysterious Case of Azure CycleCloud Jetpack Install Error

I’ve been doing a lot of HPC work using Azure CycleCloud. It can quickly deploy an entire HPC cluster in minutes with the benefit of autoscale on the compute node side. I will have a couple of posts about some things that gave me grey hair, but let’s first look at the Jetpack installation error that CycleCloud kept showing me.

I know that CycleCloud makes use of Jetpack after doing some research and was totally confused why Jetpack would be throwing an error about Alma Linux not being supported when it is the default image used for a Slurm deployment. I was trying to find some pattern, so I would delete the Slurm cluster and reprovision it. I SSH’ed into the node and verified jetpack installed just fine from the /opt/cycle logs dir. I started looking around and noticed a custom script extension is used to start the install of something.

Drilling down into the cse, you can see the output is referencing jetpackd.

At this point, on my new cluster, everything is green. No errors, so I waited for it to throw the jetpack install error again. About 30 minutes in, the error came back. Totally confused, I started looking at the logs again. I browsed to the package install path on the node and noticed that the CSE was replaced with a CSE I automatically deploy. I started thinking, is there some kind of dependency on the package the CSE initially downloads onto the VM? Well, after reviewing the logs of my CSE that is deployed, guess what, it said Fatal error: Unrecognized OS: AlmaLinux. The The issues field in the CycleCloud UI is reporting the output of the CSE. Jetpack installation error is misleading because Jetpack installed just fine. It was Crowdstrike which does not like Alma Linux. I switched over to CentOS and the error went away. Long story short, if you deploy CSE’s, know that CycleCloud will display the stderr if a problem happens and it’ll fall under a Jetpack installation error.

Migrating a CentOS Virtual Appliance to Azure

I had an interesting ask to migrate an on premise virtual appliance to Azure. I thought this would be straight forward, but here I am writing why this situation wasn’t, lol.

This particular operating system was CentOS 7. The vendor said it is supported to migrate to Azure and I found that Microsoft has documentation on how to prep a VM for Azure here. I followed all the steps except deprovisioning as I was migrating a single VM. Once I shutdown the VM, I used PowerShell to upload the vhd. I recommend using PowerShell since it handles making sure the virtual size of the vhd is aligned to 1 MiB. Azure Storage Explorer does not do this, fyi.

connect-azaccount -Environment azureusgovernment
select-azsubscription <sub>
$resourceGroup = 'rg-<project>'
new-azresourcegroup $resourceGroup -location 'usgovvirginia'
$path = 'C:\Temp\export\va\Virtual Hard Disks\va.vhd'
$location = 'usgovvirginia'
$name = 'va-os-disk'
Add-AzVhd -LocalFilePath $path -ResourceGroupName $resourceGroup -Location $location -DiskName $name -OsType Linux -Verbose 

After running the above command, the disk will be uploaded and you can create a new VM off of it. Once I did all of that, I tried to SSH into the VM and it was rejecting my credentials. I was confused, so I restored the VM into Hyper-V and I had the same issue.

Knowing it is a SSH issue, I looked at the sshd_config file and noticed PasswordAuth was set to no. The appliance had this set to Yes, so something changed it during VM prep.

After some trial and error, I narrowed it down to this part:

Now that I know what step it is happening, I accessed the terminal instead of using my ssh client to change passwordAuthentication back to yes which fixed it. Re-uploaded and I was able to SSH into the VM. But wait, that’s not all.

The Linux agent was not in a ready state. I cat /var/log/waagent.log and saw Waiting for cloud-init to copy ovf-env.xml to /var/lib/waagent/ovf-env.xml

That file did not exist in /var/lib/waagent and was lost why it wasn’t there. After reading a couple open GitHub issues about this, I found a post https://thomasthornton.cloud/2020/04/16/cloud-init-does-not-appear-to-be-running-error-after-installing-walinuxagent/ that had the same issue I had. The fix is just to manually create the ovf-env.xml file and edit the username and hostname. I did exactly that, restarted the waagent service and everything automagically started working.

Azure Windows Security Baseline

I was designing a deployment around Azure Virtual Desktop utilizing Azure Active Directory, not AADDS or ADDS and when checking a test deploy for compliance against the NIST 800-171 Azure Policy, it showed the Azure Baseline is not being met. In a domain, I wouldn’t worry since group policy will fix this right up, but what about non domain join? What about custom images? Yeah, I guess we could manually set everything then image it, but I prefer a clean base then apply configuration during my image build. Let’s take a look how to hit this compliance checkbox.

I recalled that Microsoft released STIG templates and found the blog post Announcing Azure STIG solution templates to accelerate compliance for DoD – Azure Government (microsoft.com). I was hoping their efforts would make my life a little bit easier, but after a test deploy, I saw 33 items still not in compliance.

Looking at the workflow, it is ideally how i’d like my image process to look in my pipeline.

Deploy a baseline image, apply some scripts and then I can generate a custom image to a shared gallery for use. I didn’t want to reinvent the wheel, so I started researching if anyone has done this already. I found a repo https://github.com/Cloudneeti/os-harderning-scripts/ that looked promising, but it was a year old and I noticed some things incorrect with the script such as incorrect registry paths, commented out DSC snippets, etc. This did do a good bulk, but just needed cleaned up and things added. Looking at the commented code, it was around user rights assignments. Now, the DSC module for user right assessments is old and I haven’t seen a commit in there for years. Playing around, it seems that some settings can not be set. I didn’t want to hack together stuff using secedit, so I found a neat script https://blakedrumm.com/blog/set-and-check-user-rights-assignment/ that I could just pass in the required rights and move on. Everything worked except for SeDenyRemoteInteractiveLogonRight. When the right doesn’t exist in the exported config, it couldn’t add it. So, I just wrote the snippet to add the last right.


$tempFolderPath = Join-Path $Env:Temp $(New-Guid)
New-Item -Type Directory -Path $tempFolderPath | Out-Null
secedit.exe /export /cfg $tempFolderPath\security-policy.inf


#get line number
$file = gci -literalpath "$tempFolderPath\security-policy.inf" -rec | % {
$line = Select-String -literalpath $_.fullname -pattern "Privilege Rights" | select -ExpandProperty LineNumber
}

#add string
$fileContent = Get-Content "$tempFolderPath\security-policy.inf"
$fileContent[$line-1] += "`nSeDenyRemoteInteractiveLogonRight = *S-1-5-32-546"
$fileContent | out-file "$tempFolderPath\security-policy.inf" -Encoding unicode

secedit.exe /configure /db c:\windows\security\local.sdb /cfg "$tempFolderPath\security-policy.inf"
rm -force "$tempFolderPath\security-policy.inf" -confirm:$false

After running PowerShell DSC and script, the Azure baseline comes back fully compliant. I have tested this on Windows Server 2019 and Windows 10.

You can grab the files in my repo https://github.com/jrudley/azurewindowsbaseline

AKS Double Encryption

I have been living in a world of compliance these past few weeks, specifically NIST 800-171. Azure provides an initiative for NIST and one of the checks is to make sure your disks have both a platform and customer managed key. I recently ran into a scenario where you have an application that is StatefulSet in Azure Kubernetes. Let’s talk a bit more around this and NIST.

The Azure Policy was in non compliance for my disks because they were just a managed platform key. Researching the AKS docs, I found an article for using a customer managed key, but this still is not what I need as I need double encryption to meet compliance. After some research in the Kubernetes SIGs repo, I found the Azure Disk CSI driver doc and check it out:

It looks like this document was modified back in May adding support for this feature, so recently new. Upgrade the driver to 1.18 or above and double encryption support should be there.

To implement, create a new storage class that references your disk encryption set id with double encryption.

kind: StorageClass
apiVersion: storage.k8s.io/v1  
metadata:
  name: byok-double-encrpytion
provisioner: disk.csi.azure.com 
reclaimPolicy: Retain
allowVolumeExpansion: true
parameters:
  skuname: Premium_LRS
  kind: managed
  diskEncryptionType: EncryptionAtRestWithPlatformAndCustomerKeys
  diskEncryptionSetID: "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/diskEncryptionSets/<dek-name>"

Apply this snippet above and reference the storage class in your deployment yaml to have double encryption. This will tick that NIST compliance checkbox for AKS disks.

Missing Microsoft Applications in GCC High

An awesome feature to bring some sanity to Azure VM authentication and authorization is using Microsoft Azure Windows and Linux Virtual Machine Sign-in functionality. You can quickly test this by selecting the Login with Azure AD check box during provisioning.

I wanted to add MFA and User Sign In risk checks using conditional access before a user can actually log into the VM. When setting up my policy, I could not find Microsoft Azure Windows Virtual Machine Sign-in or Microsoft Azure Linux Virtual Machine Sign-in app. I was puzzled, so I quickly checked my commercial tenant and sure enough it existed. I initially thought it was one of those not in gov cloud, but only commercial cloud situation. I created a ticket to support and they came back noting that they have seen Microsoft applications missing in GCC High tenants. The quick fix is just to manually add the missing applications. Once they told me the application Id’s are the same, we can quickly just create it.

New-AzureADServicePrincipal -AppId '372140e0-b3b7-4226-8ef9-d57986796201' #Microsoft Azure Windows Virtual Machine Sign-in
New-AzureADServicePrincipal -AppId 'ce6ff14a-7fdc-4685-bbe0-f6afdfcfa8e0' #Microsoft Azure Linux Virtual Machine Sign-In

After running those PowerShell cmdlet’s in my cloud shell, I can now successfully see the apps during conditional access creation.

Azure VM Applications

Azure has Template Specs which lets you create a self service infrastructure as code model for your end users. You can use RBAC and they can deploy versioned templates. Microsoft introduced VM applications which lets your end users do something very similar to template specs, but with applications installed inside your VM. Let’s look at a quick demo and some things to watch out for.

Assuming you have an Azure compute gallery deployed, you need to create an application then a version of that application. I pasted a snippet below to get us started.

$applicationName = 'visualStudioCode-linux'
New-AzGalleryApplication `
  -ResourceGroupName $rgName `
  -GalleryName $galleryName `
  -Location $location `
  -Name $applicationName `
  -SupportedOSType Linux `
  -Description "Installs Visual Studio Code on Linux."
 
 
$version = '1.0.0'
New-AzGalleryApplicationVersion `
   -ResourceGroupName $rgName `
   -GalleryName $galleryName `
   -GalleryApplicationName $applicationName `
   -Name $version `
   -PackageFileLink $sasVscode `
   -Location $location `
   -Install "mv visualStudioCode-linux vscode.sh && bash vscode.sh install" `
   -Remove "bash vscode.sh remove" `
   -Update "mv visualStudioCode-linux vscode.sh && bash vscode.sh update" 

The cmdlet I want to focus on is New-AzGalleryApplicationVersion. The parameter PackageFileLink is required. Not only is it required, it must be a readable storage page blob aka you cannot use a raw github link to a file. I tried using a public repo for an install script, but when running this cmdlet, it just hangs. Now, I will get to a workaround on that, but let’s continue. The Install and Remove parameter are required, but update is optional. With that, I thought a simple framework can be used.

if [ $1 == "install" ];
then
    echo "Installing...";
   <code> 
elif [ $1 == "remove" ];
then
    echo "Removing...";
    <code>
elif [ $1 == "update" ]
then
    echo "Updating...";
    <code>
else
    echo "Incorrect argument passed. Please use install, remove or update";
fi

Now I can easily just call an argument for install, remove and update. Reading about VS Code, we can use snap to handle our application installation.

if [ $1 == "install" ];
then
    echo "Installing...";
    sudo snap install --classic code 
elif [ $1 == "remove" ];
then
    echo "Removing...";
    sudo snap remove code
elif [ $1 == "update" ]
then
    echo "Updating...";
    sudo snap refresh --classic code
else
    echo "Incorrect argument passed. Please use install, remove or update";
fi

Now, going back where I said there is a workaround on referencing a public repo. This is partially true. What you can do is reference a valid url dummy file in the PageFileLink parameter then execute your commands directly in the Install, Update and Remove parameter.

   -Install "apt-get update && apt-get install ubuntu-gnome-desktop xrdp gnome-shell-extensions -y && reboot" `
   -Remove "apt-get --purge remove ubuntu-gnome-desktop xrdp gnome-shell-extensions -y && reboot" 

A couple of other things to note. Once a user starts an application install, it executes fast, but the status could use some work. If I am an end user in the portal and click install for a published application, I have to click back on the extension to see the status. The status isn’t in the VM application tab. Also, development is some what painful. If the install, update or remove section fails, it seems to go in this endless loop and a giant pain to make it stop. I couldn’t figure it out. The documentation states to uninstall the extension, which I did, but it still keeps it on the VM. It is still in preview, so I can’t complain too much. Lastly, unlike template spec’s which lets you select a spec in another subscription in the portal, this does not exist for VM Apps. You need to make sure a compute gallery is in the subscription, then it will show for the end user. This limitation does not exist when using AZ Cli, Rest or AZ Powershell, as long as you have the correct permissions to talk with the compute gallery in another subscription.

I think VM Apps has a lot of potential to make the end user experience better. Think of apps typically installed for developers to test with. We can now ensure approved and validated applications are installed which I think it is a great win!

Guest Configuration Extension Broke in Azure Gov for RHEL 8.x+

UPDATE 4/11/2022 This has been fixed!

UPATE 4/3/2022 Still broke…waiting on product to fix.

UPDATE 3/4/2022 Microsoft product group will be pushing a fix out in 2 weeks to Azure Gov. I asked what the cause was, but nothing yet.

One of the great features of Azure Policy is the capability to audit OS settings for security baselines and compliance checking. I was deploying RHEL 8.4 and noticed the Guest Assignment was always hung in the pending state. I had no issues with Ubuntu, so it had to be something happening on the RHEL vm.

I navigated to /var/lib and saw the GuestConfig folder created, but when I was inside, it was empty. Hrm, this should be populated with folders and MOF files.

[root@rhel84 GuestConfig]# pwd
/var/lib/GuestConfig
[root@rhel84 GuestConfig]# ls -al
total 4
drwxr--r--.  2 root root    6 Feb 26 22:13 .
drwxr-xr-x. 41 root root 4096 Feb 26 22:13 ..

Next step was to tail the messages log to see if anything can pin point what is actually happening.

[root@rhel84 GuestConfig]# tail -f /var/log/messages | grep -i GuestConfiguration
Feb 26 22:27:36 rhel84 systemd[7442]: gcd.service: Failed at step EXEC spawning /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service: Permission denied
Feb 26 22:27:46 rhel84 systemd[7458]: gcd.service: Failed at step EXEC spawning /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service: Permission denied

Alright, a permission denied. It’s something to start looking into, but I was confused why this is happening. I headed over to Azure commercial and spun up a RHEL 8.4 vm with the same Azure Policy to execute my security baseline. Well, to my surprise, everything worked just fine. Looking at /var/lib/GuestConfig showed the Configuration folder with mof files. Looking at the Guest Assignments, it was showing NonCompliant, so I know it is OK there. I did notice the Guest Extension in commercial is using 1.26.24 and gov is using 1.25.5. I tried deploying that version with no auto upgrade in gov, but same error.

After some research, I set selinux to permissive mode and instantly the Configuration folder was created and starting pulling the mof files down. OK, now I am really puzzled. Working with Azure support, they were able to reproduce this same issue in Gov, but not in commercial. I was shocked no other cases have been open. I am not sure when this problem started happening, but this means security baselines on RHEL 8.x+ are not working.

While I wait for Microsoft to investigate more why this is happening, I tried to find a workaround. Knowing it is selinux causing the issue, I thought I could just create a policy allowing the execution of the gc_linux_service.

I tested first by making sure selinux is set to Enforcing then using chcon to set the selinux context:

[root@rhel84 GuestConfig]# getenforce
Enforcing
chcon -t bin_t /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service

We’re all good. No error’s in the messages log. Since this could revert by a restorecon command being ran later, I added it to the selinux policy by running:

semanage fcontext -a -t bin_t /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service
restorecon -v /var/lib/waagent/Microsoft.GuestConfiguration.ConfigurationforLinux-1.25.5/GCAgent/GC/gc_linux_service

I will update my post once Microsoft comes back with a reason why this is only happening in Azure Gov and see what proposed solution they have. For now, i’d not depend on the Guest Extension to perform your compliance checking for RHEL 8.x until a fix has been pushed.