VM Broken? Use OS disk swap in Azure to fix and restore

I can’t begin to count the amount of times I screwed up a virtual machine that wouldn’t boot anymore. In most cases, that was due to messing up /etc/fstab, which controls which disks get mounted in Linux. If that file is broken, your machine won’t boot anymore.

In Azure, there’s an ability to swap the OS disk of a machine. That means you can swap in a working drive to fix the issues with your VM. I hope you never need to use it, but if you need it, it’s there for you.

In this post, I want to quickly explain how to do this and how you can use OS disk swap to get your VM back to a working state.

Process for fixing issues

Let’s walk through the process I typically use in cases where I need to swap the OS drive:

  1. Take a snapshot of the broken OS disk.
  2. Create a new disk based of this snapshot.
  3. Attach new disk to a working VM.
  4. Fix issues in working VM.
  5. Detach disk.
  6. Swap OS disk.

In summary, I create a copy of the disk, fix the issues and then swap the OS drive. You cannot detach the OS drive from a VM, since a VM always needs to have an OS drive (except for VMs using ephemeral storage).

With the process covered, let’s have a look at a working example:

Fixing a broken VM using OS disk swap

Breaking my VM

For my test setup, I pre-created 2 VMs running Ubuntu 18.04. Let’s login to the VM we plan to break, and make a change in /etc/fstab. I’ll remove the last digit of the ID of the boot disk.

And then I’ll do a reboot of the VM (in my case, from the Azure portal). This will cause the reboot, and this should cause the disk failure to present itself. I wasn’t expecting the VM to actually boot – but it did – and I could still connect to it using SSH. But while working on the VM, I ran into many issues. Just one of those was that I couldn’t restore the fstab file. Apperently the file system was mounted read-only, not read-write.

Also, trying to run sudo apt update resulted in many errors:

In summary, not the issue I was expecting, but broken nonetheless. Let’s go ahead and fix it!

Take a snapshot of the broken OS disk

First step is to take a snapshot of the OS disk. This can be done either via PowerShell/CLI or in the portal. For the purpose of this post, I’ll use the portal route.

To take the snapshot, first navigate to your broken OS disk. You can get there via the Disks part of the VM blade:

Go to the OS disk.

In the resulting disk blade, you’ll see the ‘Create snapshot’ option. Click that.

Create a snapshot

In the next blade, provide the necesarry details for the snapshot and hit Review and Create.

Provide details for the snapshot

Create a new disk based of this snapshot

Next up, we’ll create a disk from this snapshot. This cannot be done from the snapshot blade itself, you’ll have to navigate to the disk blade for this. To get there, just type disks in the search bar.

Look for the disks blade

Once there, hit the Add button on the top.

Create a new disk.

What you’ll need to do here is set the source type to snapshot, and refer to the snapshot we just created.

Create a new disk based on the snapshot.

Hit create, and give it some time to create. Then we can attach it to the working VM.

Attach new disk to a working VM

In the VM blade of the VM that you’ll use to fix this issue, navigate to the disks blade and hit the “Attach existing disks” button. Look for the disk we just created, and hit the save button.

Attach the existing disk, and hit the save button.

Give it a couple seconds to attach the disk. Once the disk is attached, connect to your VM, and mount the new disk. The commands I typically use for this:

sudo fdisk -l 
#now look for the /dev/sdX of the new disk.
sudo mkdir /mnt/broken-disk
sudo mount /dev/sdX1 /mnt/broken-disk

And with the mount done, we can now fix the issue.

Fix issues in working VM.

Fixing the issue in this case is as easy as adding the digit back into the /etc/fstab file. In this case, that file will be at location /mnt/broken-disk/etc/fstab.

I added the ‘3’ back in, and saved the updated file. Let’s move to the next step.

Detach disk

To detach the disk, navigate back to the disk blade on the fixing VM. On the line of the disk we just attached, navigate to the right part of the screen, hit the ‘X’ and hit the save button.

Detach the disk

With the disk detached, we can now swap the OS drive.

Swap OS disk

Navigate to your broken VM, open the disks section of the VM blade and hit the Swap OS disk button.

Hit the Swap OS disk button

Pick the disk you want to swap in, and confirm the action by typing in the VM name.

Interestingly enough, hitting the OK button for this, will cause Azure to stop your VM first. I was wondering if this was going to create an error because my VM was still running, but Azure was smart enough to first stop the VM.

Azure will stop the VM if it is still running.

Once the VM is stopped, this will trigger the OS disk swap:

After the VM is stopped, the swap will happen.

It doesn’t trigger an automatic start of the VM, so hit the start button to see if our issues have been fixed. And in my case – as expected – everything worked normally again. I could do my apt-get update without issue (where it failed earlier).

Everything working fine again.

Summary

There’s numerous ways in which you could mess up a VM and make it unable to boot. Having the swap OS disk option is a great option to quickly fix issues. This shouldn’t deter you from setting up good back-up and restore capabilities (e.g. Azure backup), however, swapping the OS disk can be a great quick fix to a number of issues.

Although I described a Linux focused solution here, you can do the same thing with Windows boxes, if you know which files to fix.

Leave a Reply