Bug: OLVM 4.3.10 upgrade might paint you into a corner

Oracle Linux VM comes with a nice feature that regularly checks for updates and allows you to deploy them in a rolling fashion to your cluster. Virtual machines are migrated away from one KVM host which is then upgraded and rebooted. Afterwards, the previously migrated machines are moved back to the upgraded host and the procedure runs again for the next host.

Which is nice, as long as everything works out…

While upgrading to OLVM 4.3.10, the version current at the time of writing this article, something unpleasant happened: VMs could not be migrated back to the upgraded host. When trying to migrate VMs manually, the host simply didn’t show up in the drop-down selection. Diagnosing this further I found out that VMs that were newly started on this host could also not be migrated away from it.

There were some VMs, however, that could be migrated in either direction. They had in common that they weren’t created with the “High Performance” profile (if you haven’t noticed this profile yet, check out this talk at FOSDEM 2019). One of the presets of this profile is “Pass-through CPU” which allows guest OSses to make full use of the host CPU capabilities. If CPU types differ within your cluster, this might stop your VM from being migratable.

So I checked for any differences between the upgraded and the other KVM hosts and finally found all other hosts’ CPUs to be reported as “Intel Skylake Server Family” while the upgraded host’s CPU was reported as “Intel Skylake Server-noTSX-IBRS Family”.

Bug Raised By Oracle Support

Checking this through an SR with Oracle Support it was confirmed that before OLVM 4.3.10 some CPU types or families weren’t reported precisely enough (simply speaking). The “fix” in the current OLVM version in turn leads to unwanted differences of reported CPU types between identical CPUs on hosts with different OLVM versions.

Effectively, this can paint you into a corner where upgrade hosts aren’t available to your cluster anymore, reducing scalability and availability of your cluster.

I’ll keep you updated when a response by the development team arrives for Bug 32587677. But I’m afraid there might be no workaround other than to stop VMs and re-start them on an upgraded host.

tl;dr

When upgrading to OLVM 4.3.10, brace yourself for a downtime of your precious VMs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.