Creating Best Case Conditions for Noisy VM Migrations with HCX

HCX has been used to migrate tens of thousands of applications across a varied virtual machine landscape in across business verticals, which often includes noisy “high churn” virtual machines. For this post, we’ll define these as highly active virtual machines that create very large datasets over a fixed interval (for example databases, large scale syslog collectors, data warehouse systems).

Image: brilliant.org.

“In a data transfer, we cannot exceed the limitations of each performer in the system.”

– Gabe
(An individual who does not claim any expertise in things, take his comments here, or on any topic anywhere with a grain of salt. 🤠 )

Noisy virtual machines can overwhelm the end to end resources available for migrations and, depending on conditions, can prevent a live migration (HCX vMotion & RAV) from completing the switchover.

In some cases, environmental adjustments are enough to make the transfer achievable, in other cases, it will be necessary to take a low down time approach. These two cases are the subject matter of this post.

identifying NOISY VMs during the HCX operation

In an ideal scenario, the migration admin or project planner is aware of potentially noisy workloads, and are planning for them. In some cases the virtual machine characteristics have changed or the virtual machine was not previously identified as a high churn candidate.

HCX added “High Activity VM detection” during in the late May HCX release to helps migration admins learn about noisy workloads as the migration is happening. HCX users see the high churn alerts and can make adjustments as needed to avoid prolonged, and possibly failed switchover events.

This detection feature uses VMware vCenter Server heuristics and internal Recovery Point Objective data to provide early detection of high-activity virtual machines, to calculate the impact on migrations, and generate Critical Alerts for at-risk migrations. HCX Alerts are generated when the high churn conditions occur:

VM – High churn detected
VM – High churn stopped

Creating the BEST case conditions for a best EFFORT live migration ATTEMPT

Even if you replace a rough dirt road with a multi-lane highway, there’s only so fast a person on a bike can ride that path. Similarly in a data transfer, we cannot exceed the limitations of each performer in that system. Depending on the environment conditions and the overall data rate of change, it may not be possible to live migrate certain noisy virtual machines. An alternate low downtime approach can be used instead (covered in the next section).

With that said, the noisy vm may be critical enough that a scheduled business maintenance is very difficult to obtain, and it is worth making adjustments for a best effort live migration attempt.

Some things to try for best case migration conditions:

  • Place the source migration appliance (HCX-IX-I) in the same ESXi host that is running the Noisy VM
  • Place the destination migration appliance (HCX-IX-R) in the ESXi host that will receive the VM.
  • Ensure the HCX-IX is not resource constrained
    • If needed, use the Compute Profile to set CPU / MEM reservations.
    • Deploy using fast storage.
    • If needed, reduce the VM density of the migration ESXi hosts. 
  • Ensure that source and destination ESXi hosts for the migration are not running other vMotions within the cluster
  • Make sure Fully Automated DRS is not running in the source and destination clusters.
  • During the noisy VM migration, dedicate the HCX-IX to that purpose.
  • Attempt the switchover during a time frame during off-peak hours, with reduced network congestion.
  • Ensure the source and destination environments are following VMware vMotion Best Practices.
  • If the noisy virtual machine shares the datastore with other virtual machines, SvMotion other virtual machines to other local datastores.

low downtime Migration of noisy vms

Depending on the environment conditions and the overall data rate of change, it may not be possible to use the live migration options with certain noisy virtual machines.

A low downtime approach always works, and can be used to avoid the very large outage that comes with a cold migration these large noisy VM. In a cold migration, the virtual machine is powered for the full duration of all phases of the migration (HCX can do that too).

The low downtime approach:

  1. Configure a scheduled replication-based migration transfer (HCX Bulk or RAV migration types).
  2. Assign a switchover window with enough time for HCX to complete the transfer for the noisy virtual machine’s disks.
  3. Allow the migration operation to reach a “waiting for switchover/delta sync state”.
  4. Schedule a business maintenance window to switchover the virtual machine, and perform the following steps during the maintenance window:
    1. With RAV migration (Guest OS remains online, primary service is stopped) :
      1. Reduce or Halt disk changes services (by stopping only the application services on the powered on virtual machine). HCX Alerts should display high data churn stopped state.
      2. Remove the configured RAV migration schedule to start the switchover immediately.
      3. Once the virtual machine has finished live migrated to the destination vCenter Server, start the services.
    2. With Bulk Migration (Guest OS is powered off, and restarted at the destination):
      1. Remove the configured Bulk migration schedule to start the switchover immediately.
      2. The source virtual machine will be powered off and the destination virtual machine will be powered on.
      3. Allow the OS to initialize all services.

~

To summarize

HCX detects virtual machines in a high data churn state. This can help avoid migration suprises.

The migration of noisy virtual machines should be planned. Sometimes it is possible to live migrate the noisy vm, when the best conditions are presented for the transfers. Suggestions were provided.

A planned, low downtime approach is always possible:
– RAV can be used with some manual intervention to reduce service downtime (OS stays online)
– HCX Bulk Migration can be used without intervention with reboot downtime (OS is rebooted)

That’s all I had to say about that! I wish you all a very happy Fall Season. Stay safe!

Gabe

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s