The Missing Manual Part 2: When Snapshots Go Wrong

One thing you are probably going to run into if you have virtualized your servers and are using a snapshot based based backup product are orphaned snapshots. I don’t think that is an official term, but it’s the one I’m going to use.

Basically a snapshot is orphaned when a backup job is aborted, or fails, for some reason that can’t be recovered from by by VMware or Veeam (or whatever snapshot based backup program your using). If you have read my other posts on why snapshots can be bad you might already have and idea of where this post is going.

Problems you might run into include:

  • Snapshot delta files that are not seen from the GUI
  • “Consolidation Helper-0” snapshots
  • “Too many redo logs” if you get more then 32 levels of snapshots
  • VM’s that consume entire datastores
  • VM’s that are very slow because of too many snapshots

That is probably not a complete list, but those are the ones that I’ve seen. The one that scares me the most is the “Too many redo logs” because unless your monitoring your datastores for delta files daily you might not even know its going to happen. Then one day you come in to find that your VM is powered off and you cant power it back on…. this instills panic very quickly 😉

From Gostev on the Veeam Forum:
Now, when the last snapshot is being removed from the VM, ESX host creates “consolidate helper” snapshot to host the data writes while actual snapshot is being removed. After that was done, the actual “consolidate helper” snapshot is being injected into the main VMDK by ESX. Because in order to commit the last helper snapshot VM I/O must be completely frozen (for obvious reasons), the commit can only take place if both of these conditions are true:
– Helper snapshot size is less than 16MB (which is minimal snapshot size in VMware)
– There is very little write I/O going on the VM at the given moment
 
If any of these are not true, ESX will wait, iteratively creating new helper snapshots to host writes while committing old ones (remember, it needs to have smallest possible snapshot before final commit) while waiting for a “good moment” to freeze VM and commit the last snapshot.

You might be thinking … well, I would certainly catch the problem within 32 days (before it creates 32 snapshots), but it actually takes much less time than 32 days. Depending on how you have Veeam (or your other software configured) it will retry the job several times per day. By default Veeam tries three times… so really it could only take 11 days to go from no problems to major downtime.

The best way to prevent anything bad like this from happening is to make sure that you are taking snapshots at times when I/O is relatively low. Also, make sure to watch your backup logs…anything involving a failed snapshot should be investigated immediately. If however, you do not catch it in time there are some knowledge base articles out there and some blog articles, but most will require you to be at the console, or have remote command line access. and most of the time they will have you use vmfstools to clone the disk to a new vmdk.

Sometimes you can try creating a snapshot and then select the “Delete All” option which will clean up even the unlisted snapshots.

Another snapshot issue I just ran into today was this error: (Click to enlarge)

“Unable to create snapshot: Operation failed because file already exists or Cannot complete the operation because the file or folder [DatastoreName] VMname/VMname.vmx already exists”

Basically, this was preventing Veeam from doing its backups, as well as me from taking a new snapshot. So before powering it off, I created a backup with an agent-based backup program where it streamed the data out to a NAS. Then I found VMware KB article 1008058. It explains that sometimes when VMware takes a snapshot it will create some of the files… but not all of them… before it fails. Leaving the VM in an inconsistent state, and not allowing any more snapshots to take place.

What I found was that my 00004 snapshot had a -delta file… but no vmdk file. ( click the picture below to see the datastore, you will see a -00004-delta.vmdk file but no actual data vmdk file) So I deleted the -delta file and then created a snapshot while the VM was still powered off. Then I clicked “Delete All” and it went back and deleted all orphaned snapshots for the VM, effectively committing all of the data to the base vmdk file.

The general rule that I try to follow is that if you need to keep a restore point more then a few days .. take that restore point with Veeam or whatever backup software you are using, and no not leave a snapshot sitting around. But if you are only testing an update or something simple use a snapshot to protect yourself while you update, but as soon as you are convinced that the update was a success, delete the snapshot.

Overall I think when snapshots are given a little TLC they can be a very good thing, but if left to run on their own with little or no management you are inviting disaster.

Loading

Share This Post

3 Responses to "The Missing Manual Part 2: When Snapshots Go Wrong"

  1. Thank you for sharing your knowledge Justin, This particular posting “The missing manual part 2” .. has totally helped me out . I’ve been having these issues, and i have been searching for a while for answers… So again Thank you sir and keep up the good work

Post Comment