A couple of weeks ago I sat in a meetup, where Dave Stanke from Google mentioned he sometimes creates self-destructing VMs. I had never thought about self-destructing VMs, and decided to try this out with Azure.
What’s required for a self-destructing VM
First things first, we’ll need a way to create our VM. Through this creation, we’ll schedule a task in the OS to delete the VM itself. To allow the VM do this, we’ll need to give it an identity and a role binding to allow it to delete itself.
Let’s think this through: a VM in Azure consists of a VirtualMachine object, but a NetworkInterface and Disk are also attached. That NetworkInterface exists within a VirtualNetwork – and will probably also have a PublicIP address. If we schedule the deletion in the VM itself – we won’t be able to delete the peripheral resources. Hence – I’ll be planning to create a brand new network in a brand new resource group – and instruct the VM the delete the resource group.
Second, how will we delete the VM? We could install the Azure CLI on the VM, but that takes a while to setup. So we’ll be doing a straight REST API call.
Third, how will we schedule the task to delete itself? There is a Linux utility called at
that can schedule tasks, and even has a very ‘natural language’ scheduling system. We could schedule our script to execute the following way: at now + 15 minutes -f kill.sh
Finally, we need to consider the language we’ll use to create our resources. I’ve been warming up to Terraform lately, hence I’ll also execute this via Terraform.
Writing this out
The code for this Terraform template can be found on my GitHub. There’s two sections worth highlighting here:
In the VM section you’ll see the following:
os_profile {
computer_name = "killme"
admin_username = "nilfranadmin"
custom_data = <<-EOF
#!/bin/bash
sudo apt-get install at -y
echo "response=\$(curl 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fmanagement.azure.com%2F' -H Metadata:true -s)" > killme.sh
echo "access_token=\$(echo \$response | python -c 'import sys, json; print (json.load(sys.stdin)[\"access_token\"])')" >> killme.sh
echo "curl -X DELETE -H \"Authorization: Bearer \$access_token\" -H \"Content-Type: application/json\" https://management.azure.com/${azurerm_resource_group.main.id}?api-version=2019-05-10" >> killme.sh
at now + ${var.timeout} -f killme.sh
EOF
}
The custom_data section is a script that will execute on the virtual machine at creation time. It took me some iterative development to get all the quotes and backslashes just right – just this one does the job. Essentially what this script does is it installs at
– it creates a script that will get a oauth token and call the API to delete the resource group – and then schedule that script.
Another interesting section is here:
resource "azurerm_role_assignment" "test" {
scope = "${azurerm_resource_group.main.id}"
role_definition_name = "Contributor"
principal_id = "${lookup(azurerm_virtual_machine.main.identity[0], "principal_id")}"
}
This sections creates a role assignment, that gives our ‘virtual machine’ system identity contributor access to our resource group. This enables the VM to delete itself.
Conclusion and considerations
This template is a rather simple template that creates a VM that will delete itself. It creates it in it’s own network, and all the resources will deleted after creation.
Thinking through this template after creation – I believe there are more elegant ways to solve this than have the VM delete itself. I started thinking through a mechanism using tags and an automation job (maybe a logic app) to do this in a more suitable way. More on that later, I hope you at least enjoyed this approach!
“I believe there are more elegant ways to solve this than have the VM delete itself.” –> I agree, it feels a bit like bloat to install the mechanism onto the VM itself. It’s also a bit troublingly opaque; I can imagine someone new to the team being super confused about why their VMs keep blinking out of existence.
On the other hand, if you’re worried about paying $$$ for zombie instances (which I am), the benefit is that it’s self-contained. Even if somehow all the external automation got messed up, the VM will still terminate on schedule.
We’re both thinking exactly the same thing. If I find a couple hours in the coming weeks – I’m going to work through a way to do this in an automated way. I was thinking on tagging the VM and the surrounding resources (disk / network interface) with a “deleteBy” date, and then running a periodic function/automation job that would scan those tags.
Like you said, that creates extra work, but could create some more tracebility. It’s just fun playing around with those things 🙂