SharePoint UI Migration Bug

I am unsure if this is just a GCC High bug, but if you navigate to your SharePoint online admin portal and click on the migration link on the left side, you get an endless spinning icon.

My first thought was to check the dev tools to see if it was some how getting blocked, but all it showed was a status code of 200 doing a POST. Off to the AAD Sign-in Logs…

After filtering for my account, I saw a generic app is disabled message.

It didn’t dawn on me, but a co worker did see that the app was disabled in Enterprise Applications and under Microsoft applications. Thinking that was it, but nope, next error:

At this point, my gut was telling me something is wrong on Microsoft’s end. I popped a ticket and once they circled back to the SharePoint team, it is indeed a bug. So, how do you workaround this? Add /fileshare to the end of the URL.

After that, the Migration manager pulled up…

TIL: Conditional Access With Applications That Have Service Dependencies

I was testing a couple of conditional access policies against specific Office 365 cloud applications and came across a weird situation when doing a what-if analysis. Let’s take a simple scenario of blocking access to Exchange Online if the user is is not in our trusted IP list. Doing a what-if analysis, it works as expected when I test my account against that application.

Now, let’s setup a CAP targeting Teams and do a what-if analysis.

Both Exchange and Teams CAP will be applied, even though I am only targeting Teams. After digging more into this, there are service dependencies for conditional access based on the application being used. Looking at Teams, the user needs to satisfy access to SharePoint and Exchange before signing into Teams.

Trying to figure out why this happens lead me to https://learn.microsoft.com/en-us/azure/active-directory/conditional-access/service-dependencies which has the screenshot you saw above. They have this nifty tip 🙂

Most organizations will target the Office 365 app and not individual applications in their CAPs. Just keep this little tidbit in the back of your head not to get blindsided when doing CAP work. 🙂

PnP PowerShell and Azure functions assembly conflict

If you have worked with SharePoint Online, I am sure you are no stranger to PnP PowerShell module. It sits on top of graph, CSOM and the SharePoint Rest API which brings a lot of enhanced capabilities. Let’s look into an issue and how I solved it when it related to consuming this module in an Azure PowerShell Function.

There is a stable release at version 1.12.0 and a nightly build which is their preview at version 2.x.x-nightly. Typically, you would use the stable build, which is what I did in Azure. I loaded up that module in my requirements.psd1 file in Azure

I fired off a quick test where I connect to SharePoint online and I was presented with this:

A quick Google search led me to this page https://github.com/pnp/powershell/issues/2136

After reading the 80+ comments, everyone was using a consumption plan. There was a fix being rolled out which you could verify with a consumption plan as you can see the runtime version being used. How about if you were not on a consumption plan, such as I? With GCC High, we’re always last for everything and I just assumed the fix wasn’t rolled out. I did provision a quick Azure Function consumption plan and noticed my runtime was set to the version 14.15.1 that fixed this issue. I deployed my code on there and it still didn’t work, but once I changed my module version to a nightly build, it worked. OK, great the issue is fixed. Now, how do I fix this on my App Service Plan? I cannot see the runtime version. I initially thought, maybe the runtime version was upgraded and I just needed to set a nightly build version in my requirements.psd1. I did that, but no dice. I then remembered about an issue I had a few years ago when CSS told me that upgrading your plan to a Pv sku will force it to move off the existing host. I did that and sure enough, everything started working. I then changed my plan back to the B sku and it was put on a host that seemed to have the updated runtime as well.

I am not making assumptions, but it seems Microsoft rolled out the fix, but not to existing hosts that have sites deployed on it or they are still rolling out to GCC High. I wish there was some kind of blog or link in the portal that said they are rolling out X. Long story short, if you need an updated runtime, just upgrade your plan then downgrade back to your original.

Logic Apps – GCC High SharePoint Connector – Maury Povich Edition

If you came across this post because you are trying to use the SharePoint connector with Logic Apps in GCC High…let’s continue below…

https://learn.microsoft.com/en-us/connectors/connector-reference/connector-reference-logicapps-connectors shows that SharePoint is supported in Azure gov, but that turned out to not be true.

After spending too much time trying to figure out how to make the SharePoint Connector log into GCC High, I popped a support ticket. Microsoft quickly told me that it indeed does not work in GCC High, but should be available this year (2023). I had to laugh because the support rep said it would be quicker to implement the fix into GCC High than update the documentation. LOL, I’ve been working with Azure for so long that statement is 100% true.

Workaround? Power Automate. That stinks because it requires a license, but it is what it is.

Azure CycleCloud, Slurm and Star-CCM+ v17

I am a huge fan of Azure CycleCloud. It makes administration tasks so much easier. Now, there are some cons of running it in Gov Cloud, such as the core-hours does not work, but that isn’t much of a big deal if you deploy to a subscription and just calculate the costs from that. Let’s talk about using Slurm and Star-CCM. Microsoft published a whitepaper back in 2020 using a PBS cluster and version 14 of Star-CCM. I prefer Slurm and version 17 is out for Star-CCM, so let’s look at a more updated tutorial.

Microsoft recently updated CycleCloud to version 8.3 last month. You can find the releases notes here. Search the Azure Marketplace for CycleCloud and provision an 8.3 cluster. One thing to note and this sort of makes me question some default provisioning, but by default, it’ll give a public IP to your cluster. One of the mystery’s of CycleCloud is using this on an internal network, not exposed to the internet. Most orgs have a S2S vpn or express route into their vnet. For my demo, I am deploying into a spoke that has a peered hub connected back to on premise. When you provision the cluster, do not select a public IP and it’ll assign a private ip. More about this later when we deploy Slurm.

Once your cluster is up, navigate to https://<privateIp&gt; and add your name, ssh key and subscription details. These steps are documented here. Create a new cluster and select Slurm within the CycleCloud UI. I called my cluster slurm-test


For required settings, I gave my scheduler and login node a VM type of Standard_B4ms. For my HPC and HTC VM sku, I am using an HC44rs. Since this is 44 cores per vm, I wanted a total of 176 cores which would give me 4 vm’s that I can submit jobs to.

In Required Settings, make note of your Default NFS Share. I provisioned 256GB since I have a rough idea of what my resulting sim files will be. The thing to note is that I had one heck of a time trying to expand this disk once the scheduler was provisioned. When you terminate the scheduler, it does not delete this disk. When you go to start your scheduler back up, it re-attaches this disk. I could resize it in Azure, but AlmaLinux was not seeing the new size. Maybe I was just doing something wrong, but I resized numerous disks in RHEL and CentOS in Azure with no issue.

In the Advance settings screen is where I want to talk about private networking. Uncheck Name as Hostname. This will write all the hostnames into its hosts file locally, so all the compute nodes and scheduler can talk together. My Vnet is using custom DNS and not Azure, so these nodes need a way to resolve to each other, hence unchecking that checkbox. Also, uncheck all the Advanced Networking checkboxes. This cluster will only be accessible from on premise and not the internet.

In the cloud-init configuration, let’s add some things for the scheduler and HPC. I am not using HTC, login nodes or an HA scheduler, so I am skipping using cloud init there.

#cloud-config
runcmd:
   - yum install -y libSM libX11 libXext libXt libnsl.x86_64 git mpich
   - wget https://aka.ms/downloadazcopy-v10-linux
   - tar -xvf downloadazcopy-v10-linux
   - sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
   - echo "/dev/mapper/vg_cyclecloud_builtinsched-lv0   /sched   xfs defaults 0 0" >> /etc/fstab
   - echo "/dev/mapper/vg_cyclecloud_builtinshared-lv0  /shared  xfs defaults 0 0" >> /etc/fstab 
#cloud-config
runcmd:
   - yum install -y libSM libX11 libXext libXt libnsl.x86_64 git mpich

The initial cloud init installs some requires packages I need for StarCCM and pulling some git files down. The thing to note are the 2 lines that write into the /etc/fstab. This is a bug and no idea why this wasn’t fixed in the 8.3 release. If you don’t add these lines in, when rebooting the scheduler, you won’t be able to SSH to it. Microsoft confirmed this for me via a support case.

At this point, we are ready to start our Slurm cluster. Hit the Start button and give it about 5-10 mins to provision a VM and install the software.

Once the scheduler is running, highlight it and select Connect to get your SSH command given out to you prepopulated with your username and IP. Once connected, copy down your Star-CCM file. I am using a managed identity from my scheduler to connect to a storage account. In my cloud init script, I install azcopy to make life easier to get files on and off the scheduler.

Before I install Star-CCM, I navigated to /shared and created an apps folder to hold my Star-CCM install and a data folder to hold my job files. Assign permissions to those 2 folders with the account you plan to execute jobs under (If you plan to run under root, which you shouldn’t, you’ll need to add 2 env var’s which Slurm will complain and tell you after doing a sbatch) With that, we can then untar the file and run a single command to silently install.

 sudo ./STAR-CCM+17.04.008_01_linux-x86_64-2.17_gnu9.2-r8.sh -i silent -DPRODUCTEXCELLENCEPROGRAM=0 -DINSTALLDIR=/shared/apps -DINSTALLFLEX=false -DADDSYSTEMPATH=true -DNODOC=true

I have a license server already, so I am not installing it hence the installflex=false.

Alright, at this point, Star-CCM should be installed. Copy your .sim file to your scheduler. We need to create a sbatch script. I’ll provide a template here:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=44
#SBATCH --output=output.%j.starccm=demo
#SBATCH --time=03:00:00
#SBATCH --job-name=demo_test
module purge
module load mpi

INSTALL_DIR="/shared/apps"
DATA_DIR="/shared/data"
CASE="test.sim"
STARCCM_VERSION="17.04.008-R8"

export PATH=$INSTALL_DIR/$STARCCM_VERSION/STAR-CCM+$STARCCM_VERSION/star/bin:$PATH

starccm+  -batch -power -mpi openmpi -bs slurm -licpath 1999@myLicenseServer.full.dns.name $DATA_DIR/$CASE

To sum it up, I am telling Slurm to run my submission on 1 node and use 44 cores available on that node. Give it a max runtime of 3 hours and call the job demo_test. The magic is executing starccm+ and specifying our batch system as slurm. That’ll take our #SBATCH settings above and use them.

Now, we can run a sbatch slurmtest.sh. We can check our job submission with squeue and see which node is being spin up. We can also check in the CycleCloud UI

Once the actual vmss instance is started, it’ll create a log file on the scheduler that we set in our sbatch output. If we cat that file, you can verify we have all 44 cores being used.

Once the job is finished, it’ll save it in the same location where the case file was. It’ll be displayed in the output file as well.

What is really cool is that once the job finishes, CycleCloud will spin down the HC44rs vmss instance it created, so we aren’t paying for these large HPC vm’s when not using it. StarCCM pairs really well with Slurm and CycleCloud. I highly recommend it!

AVD Scaling

I was doing a POC around AVD in Gcc High and wanted to implement autoscale. Reading Microsoft’s documentation, the new autoscale is not in GCC High. I had to either write my own, use some 3rd party solution or leverage their original logic app/runbook solution. I opted using Microsoft’s original solution, but of course, it seems this is no longer being updated. It uses RunAs accounts which MS will retire on 9/30/2023.

I don’t understand why they just don’t update the solution to use a managed identity, but oh well. Not knowing if their new autoscale will make it to GCC High, I updated it to do a couple of things worth sharing.

The first thing we need to do is deploy the logicapp/automation account solution. It will deploy an automation account, upgrade some modules and deploy a logic app. We will want to make sure the system managed identity is enabled on the automation account and assigned the contributor role to the subscription. Opening up the runbook deployed, find

$AzAuth = Connect-AzAccount -ApplicationId $ConnectionAsset.ApplicationId -CertificateThumbprint $ConnectionAsset.CertificateThumbprint -TenantId $ConnectionAsset.TenantId -SubscriptionId $ConnectionAsset.SubscriptionId -EnvironmentName $EnvironmentName -ServicePrincipal

and replace it with

$AzAuth = Connect-AzAccount -Identity 

add -EnvironmentName AzureUSGovernment if you are hitting GCC High.

I commented out $ConnectionAsset = Get-AutomationConnection -Name $ConnectionAssetName as we aren’t using a runas account anymore. In your logic app, you can just make the request parameter empty “ConnectionAssetName”: “”

At this point, we’re using a managed identity to log in. Great, but let’s start thinking why we are using this solution. Yes, we need more VMs to satisfy user load, but it is also a cost savings tool. If the VMs will shut down at night, why pay for Premium_LRS disks? We can easily add a function that converts the disk at shutdown and startup.

I added a simple function to the runbook:

function Set-AvdDisk {
    param (
        [string]$rgName,
        $vm,
        [ValidateSet("Standard_LRS", "Premium_LRS")]
        [string]$diskSku
    )

    if ($convertDisks) {	
        if ($vm.PowerState -eq 'VM deallocated') {
            write-output "VM $($vm.Name) is deallocated, checking disk sku"
            $vmDisk = get-azdisk -ResourceGroupName $rgname -DiskName $vm.StorageProfile.OSDisk.Name
            if ($vmDisk.sku.name -ne $diskSku) {
                write-output "Changing disk sku to $diskSku on VM $($vm.Name)"
                $vmDisk.sku = [microsoft.Azure.Management.Compute.Models.DiskSku]::new($diskSku)
                $vmDisk | Update-AzDisk 
            }
        }
    }
}

I just call that function in the foreach loop when the runbook starts a VM up:

That will ensure the disk is Premium sku when starting the VM up, but what about shutdown? That is the real cost saving. At the end of the script when all jobs are completed, I just run a simple powershell script that pulls back all the deallocated vm’s and runs the function above to convert the disks back to Standard_LRS.

    Write-Log 'Convert all deallocated VMs disk to Standard Sku'
    $stoppedVms = Get-AzVm -ResourceGroupName $ResourceGroupName -status | where {$_.PowerState -eq 'VM deallocated'}
    foreach ($vm in $stoppedVms) {
        Set-AvdDisk -rgName $vm.resourcegroupname -vm $vm -diskSku 'Standard_LRS'
    }

When the runbook starts and a scale up or down action hits, you can see the result in the output of the job. The screenshot below is turning on a VM. It will change the disk sku back to Premium_LRS for that specific vm before starting it up.

Here is a screenshot of a VM being deallocated which you can see the disk being set to Standard_LRS to reduce cost.

Feel free to modify the script. I’ll honestly say that it took me 5 mins to put this together, so there is room for improvement, but this shows that the functionality will work. I’ll eventually get around to making this better, things like tag support to skip disks for some reason, but for now, enjoy.

Disclaimer: Test this before implementing.

https://github.com/jrudley/avd/blob/main/WVDAutoScaleRunbookARMBased.ps1

Missing MSI when using the Microsoft HPC Pack in CycleCloud 8.2

I had a project that required HPC pack, so I went right to CycleCloud to provision my new cluster. When I came to the Secrets and Certificate section, I was going to use a Key Vault to store my certificate and password. After selecting the MSI Identity dropdown, it was empty.

I started looking in the documentation and found https://learn.microsoft.com/en-us/azure/cyclecloud/hpcpack?view=cyclecloud-8#azure-user-assigned-managed-identity which stated you must use a user assigned managed identity with GET for Secret and Certificate. I had that assigned already and still no dice.

Using KeyVault is a hard requirement as I don’t want to be passing PFX and passwords into CycleCloud. After unsuccessfully trying to figure out why my MSI is MIA, I popped a support ticket. After explaining the situation, the rep was able to reproduce it, but eventually came back with a solution. CycleCloud runs a job every hour that will discover new Azure resources. We can force this update by ssh’ing into the CycleCloud node and as root run the following command:

/opt/cycle_server/cycle_server  run_action Run:Application.Timer -eq Name plugin.azure.monitor_reference

Select your Subscription in the CycleCloud UI and click the Tasks tab. You will see a task running that is collecting reference data from Azure.

Once this task is finished, navigate back to creating the HPC Pack cluster and the dropdown should populate the managed identity.

Either have patience or run the command above to force the discovery. 🙂

The Mysterious Case of Azure CycleCloud Jetpack Install Error

I’ve been doing a lot of HPC work using Azure CycleCloud. It can quickly deploy an entire HPC cluster in minutes with the benefit of autoscale on the compute node side. I will have a couple of posts about some things that gave me grey hair, but let’s first look at the Jetpack installation error that CycleCloud kept showing me.

I know that CycleCloud makes use of Jetpack after doing some research and was totally confused why Jetpack would be throwing an error about Alma Linux not being supported when it is the default image used for a Slurm deployment. I was trying to find some pattern, so I would delete the Slurm cluster and reprovision it. I SSH’ed into the node and verified jetpack installed just fine from the /opt/cycle logs dir. I started looking around and noticed a custom script extension is used to start the install of something.

Drilling down into the cse, you can see the output is referencing jetpackd.

At this point, on my new cluster, everything is green. No errors, so I waited for it to throw the jetpack install error again. About 30 minutes in, the error came back. Totally confused, I started looking at the logs again. I browsed to the package install path on the node and noticed that the CSE was replaced with a CSE I automatically deploy. I started thinking, is there some kind of dependency on the package the CSE initially downloads onto the VM? Well, after reviewing the logs of my CSE that is deployed, guess what, it said Fatal error: Unrecognized OS: AlmaLinux. The The issues field in the CycleCloud UI is reporting the output of the CSE. Jetpack installation error is misleading because Jetpack installed just fine. It was Crowdstrike which does not like Alma Linux. I switched over to CentOS and the error went away. Long story short, if you deploy CSE’s, know that CycleCloud will display the stderr if a problem happens and it’ll fall under a Jetpack installation error.

Migrating a CentOS Virtual Appliance to Azure

I had an interesting ask to migrate an on premise virtual appliance to Azure. I thought this would be straight forward, but here I am writing why this situation wasn’t, lol.

This particular operating system was CentOS 7. The vendor said it is supported to migrate to Azure and I found that Microsoft has documentation on how to prep a VM for Azure here. I followed all the steps except deprovisioning as I was migrating a single VM. Once I shutdown the VM, I used PowerShell to upload the vhd. I recommend using PowerShell since it handles making sure the virtual size of the vhd is aligned to 1 MiB. Azure Storage Explorer does not do this, fyi.

connect-azaccount -Environment azureusgovernment
select-azsubscription <sub>
$resourceGroup = 'rg-<project>'
new-azresourcegroup $resourceGroup -location 'usgovvirginia'
$path = 'C:\Temp\export\va\Virtual Hard Disks\va.vhd'
$location = 'usgovvirginia'
$name = 'va-os-disk'
Add-AzVhd -LocalFilePath $path -ResourceGroupName $resourceGroup -Location $location -DiskName $name -OsType Linux -Verbose 

After running the above command, the disk will be uploaded and you can create a new VM off of it. Once I did all of that, I tried to SSH into the VM and it was rejecting my credentials. I was confused, so I restored the VM into Hyper-V and I had the same issue.

Knowing it is a SSH issue, I looked at the sshd_config file and noticed PasswordAuth was set to no. The appliance had this set to Yes, so something changed it during VM prep.

After some trial and error, I narrowed it down to this part:

Now that I know what step it is happening, I accessed the terminal instead of using my ssh client to change passwordAuthentication back to yes which fixed it. Re-uploaded and I was able to SSH into the VM. But wait, that’s not all.

The Linux agent was not in a ready state. I cat /var/log/waagent.log and saw Waiting for cloud-init to copy ovf-env.xml to /var/lib/waagent/ovf-env.xml

That file did not exist in /var/lib/waagent and was lost why it wasn’t there. After reading a couple open GitHub issues about this, I found a post https://thomasthornton.cloud/2020/04/16/cloud-init-does-not-appear-to-be-running-error-after-installing-walinuxagent/ that had the same issue I had. The fix is just to manually create the ovf-env.xml file and edit the username and hostname. I did exactly that, restarted the waagent service and everything automagically started working.

Azure Windows Security Baseline

I was designing a deployment around Azure Virtual Desktop utilizing Azure Active Directory, not AADDS or ADDS and when checking a test deploy for compliance against the NIST 800-171 Azure Policy, it showed the Azure Baseline is not being met. In a domain, I wouldn’t worry since group policy will fix this right up, but what about non domain join? What about custom images? Yeah, I guess we could manually set everything then image it, but I prefer a clean base then apply configuration during my image build. Let’s take a look how to hit this compliance checkbox.

I recalled that Microsoft released STIG templates and found the blog post Announcing Azure STIG solution templates to accelerate compliance for DoD – Azure Government (microsoft.com). I was hoping their efforts would make my life a little bit easier, but after a test deploy, I saw 33 items still not in compliance.

Looking at the workflow, it is ideally how i’d like my image process to look in my pipeline.

Deploy a baseline image, apply some scripts and then I can generate a custom image to a shared gallery for use. I didn’t want to reinvent the wheel, so I started researching if anyone has done this already. I found a repo https://github.com/Cloudneeti/os-harderning-scripts/ that looked promising, but it was a year old and I noticed some things incorrect with the script such as incorrect registry paths, commented out DSC snippets, etc. This did do a good bulk, but just needed cleaned up and things added. Looking at the commented code, it was around user rights assignments. Now, the DSC module for user right assessments is old and I haven’t seen a commit in there for years. Playing around, it seems that some settings can not be set. I didn’t want to hack together stuff using secedit, so I found a neat script https://blakedrumm.com/blog/set-and-check-user-rights-assignment/ that I could just pass in the required rights and move on. Everything worked except for SeDenyRemoteInteractiveLogonRight. When the right doesn’t exist in the exported config, it couldn’t add it. So, I just wrote the snippet to add the last right.


$tempFolderPath = Join-Path $Env:Temp $(New-Guid)
New-Item -Type Directory -Path $tempFolderPath | Out-Null
secedit.exe /export /cfg $tempFolderPath\security-policy.inf


#get line number
$file = gci -literalpath "$tempFolderPath\security-policy.inf" -rec | % {
$line = Select-String -literalpath $_.fullname -pattern "Privilege Rights" | select -ExpandProperty LineNumber
}

#add string
$fileContent = Get-Content "$tempFolderPath\security-policy.inf"
$fileContent[$line-1] += "`nSeDenyRemoteInteractiveLogonRight = *S-1-5-32-546"
$fileContent | out-file "$tempFolderPath\security-policy.inf" -Encoding unicode

secedit.exe /configure /db c:\windows\security\local.sdb /cfg "$tempFolderPath\security-policy.inf"
rm -force "$tempFolderPath\security-policy.inf" -confirm:$false

After running PowerShell DSC and script, the Azure baseline comes back fully compliant. I have tested this on Windows Server 2019 and Windows 10.

You can grab the files in my repo https://github.com/jrudley/azurewindowsbaseline