Should I Be Using SAN Snapshots as a Backup Solution?
last week on twitter, I spotted someone who posted a link to sqlmag.com article Should I Be Using SAN Snapshots as a Backup Solution? http://sqlmag.com/blog/should-i-be-using-san-snapshots-backup-solution
Personally, I have a very aggressive attitude toward false and ignorant information published by known authors on the internet. People listen and look to us as their source of information, this is why people like us (Known Authors, MVPs, Bexperts and vExperts) should take extreme care when publishing our opinions on the internet, because people will form various types of concepts and decision based on our feedback.
The above article is one of the rare articles backed up with extensive false information and massive SAN ignorance which deliver wrong concepts that leads to wrong decisions and wrong understanding.
I teamed up with Mark Arnold from Netapp; he is a fellow MVP and SAN sage and we responded back to the above article providing evidence and correct information to the readers, I provided generic information and Mark provided Netapp’s related information. Although I worked for a Netapp partner for 3 years, Mark let no chances for me to respond or add anything as his comments were very detailed. I am fortunate to work with him in this article.
Below is our detailed response, we hope that we will be able to correct some concepts.
My response is pretty basic. As the DBA I want to control the database backups and I don’t want to be using SAN Snapshots as a backup, because no matter what any vendor says a Snapshot is not a backup. Here are some of my reasons why I wouldn’t want to use SAN snapshots instead of database backups.
With just about all solutions you as the DBA get full and complete control of your backups. You get to schedule them and keep them as maintenance jobs or Windows task manager jobs. You get to determine what you want to back up and how often. You get full control over Transaction Log backups for databases in Full recovery mode and also control over Simple recovery mode databases. You are correct, however, in stating that a snapshot is not a backup. You do something else with that snapshot. You either replicate it to a mirrored location, a long term retention location, a VTL or other mechanism. The point is that you take the snapshot as an efficient mechanism to take the final backup.
You speak about old SAN technologies, most of modern SAN are capable of integrating SAN snapshot into SQL by enabling full SQL backup via the SAN snapshot, some backup application can integrate with the SAN snapshot as well.
If there’s a problem and the database needs to be restored it’s the DBA that’s going to be thrown under the bus not the storage admin. If I’m going to be the one getting the blame, I’m going to be in control of the situation.
All solutions will allow you to do a restore in any way you deem it. If you look at, for example, SnapManager for SQL Server the DBA is the person in charge. The storage admin simply has no way to help you. It’s your tool and it’s your job to do the snapshots and restores. You will be responsible for telling the storage admin what you want done with the snapshots and that’s about your only interaction on this subject with the storage admin.
From where did that come from, we are 2013 not 1980, modern SAN provide their own snapshot tools and applications to offload the work from the storage admin and empower application admins.
If one or two database pages get corrupt do I really want to restore the entire database to the last snapshot and loose all the changes since then? What is the page become corrupt a month ago and it wasn’t found until now? Now we have to loose a months worth of data to restore the corrupt page? I want the ability to restore just that page using the native page level restore features which require having actual SQL Server backups.
The process that you have in mind misses fundamentally basic storage capabilities. You are able to keep hundreds of snapshots on line or at worst near line. You are able to execute rapid cloning (NetApp call it FlexClone and the other vendors have their own name for broadly the same thing) so that you take your snapshot, make a zero-space read/write copy (clone) and then present it to the SQL server. You as the DBA then go into the temporarily created database and decide what you need to do. You most certainly do not do a full database restore and then roll-forward the logs.
SAN snapshots are not dump, they are integrated with the app so you can treat it as normal backup, some backup application can interpret the snapshot and extract the required data from the snapshot.
You are limited to the times when the snapshot was taken. If I want to roll the database to a point between two snapshots that isn’t possible. When it comes to point in time restore I need the ability to control to which exact point in time the restore happens. Telling me that it’ll be restored to what ever point in time it was when the snapshot was taken isn’t good enough a lot of the time. I need to be able to restore the a specific millisecond.
This is just flat-out wrong. Backup solutions can conduct transaction log backups and their GUI’s have the ability for you to take a given database backup and roll forward logs to any place you want. SnapManager for SQL will do that for you, as will the competitive but complimentary offerings from EMC etc.
If the LUN which the database on it fails all the backups are lost (they are snapshots not clones). And the excuse of but that won’t happen isn’t a valid excuse. Anything that can fail, will.
When you are taking snapshots you are assuming that the LUN hosting the original data will still be there. If that LUN goes away for some reason (failed disk, human error, etc.) we’ve just lost the snapshots as well which means we have no backups.
Murphy’s law does indeed apply. As has been said, snapshots aren’t backups until such time as you do something with that snapshot. Modern systems (FAS, VNX, Isilon etc. etc) all give advanced capabilities. Their RAID subsystems make the likelihood of a failure incredibly small – though not impossible! The point is that the storage systems do “something” with the snapshots so that, if the worst happens, you can get the data back.
5. The backups are now stored on the same device as the production data. If the device fails you’ve lost access to your backups until the device is restored.
Again, false in a correctly designed environment. Replicated data (RecoverPoint, SnapMirror et al) enable full recovery and return to service – both in a Business Continuity and Disaster Recovery context. Solutions such as MetroCluster even provide customers who have a zero downtime requirement a synchronously mirrored solution so that if the primary storage system fails the secondary will take up the load instantaneously and seamlessly to the client server.
Remember, snapshots aren’t backups until you do something with those snapshots and a properly designed environment does just that.
But what will fail, the path (which is redundant), the SAN which has redundant controller, the Disk which is in a RAID, what device failure we are talking about exactly.
6. The backups can’t be compressed.
Those snapshots are going to get large, fast. With native SQL Server backups I’ve got backup compression (assuming you are running the right version and edition of SQL Server) and I’ve got 3rd party tools which I can use to compress the backups with
Absolutley false. Three things come into play here. Snapshots don’t generally get large because they are, on the storage, incremental changes to the physical disk map. Only the delta’s are part of the snapshot and only the delta’s are replicated to the backup solution. Data can be deduplicated by walking the disk and identifying blocks with the same checksum and then walking the bits to make sure there hasn’t been a checksum clash which is much faster than any software compression/dedup implementation. Identical blocks are then deduplicated. Each vendor has their own way of doing this, some in real time, some on a schedule. Broadly the same concept in all cases. Finally, compression. Yes, modern storage solutions offer compression either instead of or in supplement to deduplication.
7. As the DBA I have no control as to how many backups are kept.
While it’s awesome that the storage array can keep 500 backups, that means that we are responsible for making sure that 500 additional copies of our data aren’t being lost, stolen, copied to another company, copied to another server, mounted to the wrong server, etc. One of the reasons that DBAs only want a small number of backups on site at any one time is so that we don’t have to keep track of so many backups and who’s touching them.
Absolutely false. The Snapshot-taking products give you absolute control over this. They let you say if and when the replication to BC/DR takes place, if and when the replication to longer-term storage takes place and how many snapshots you want to maintain on line / near line. You do however want to talk to the storage/backup teams as to how long they keep the data that they have streamed to tape (for example). That’s a trivial piece of interaction with your colleagues. The take-away here is that you get to determine how many of these backups (quantity and/or days back) you want to keep under your direct control (for cloning, restore purposes etc).
8. Taking a recoverable snapshot requires pausing the IO within the SQL Server every time the snapshot is taken which can lead to inconsistent performance for the end users.
In order to snap the databases to get these database snapshots we have to checkpoint the database and pause all IO while the snapshot is being taken. For any users who are writing to the database while this is happening they will see their sessions hang for up to 10 seconds while this is happening. They then complain to the DBA that the database is slow when in fact it’s the snapshot which is causing the problem.
All vendors, including Microsoft do their snapshots using the same API’s. The API’s are designed to prevent the behaviour you describe. What you have written, in the way that you wrote it, is pure scaremongering. remember this is 2013.
9. If there’s a problem with a backup there’s no way to know without attaching the backup to a SQL Server, usually after there’s been a major problem. With native SQL backups I can easily restore the backups to another server to test them rolling transaction logs forward as I see fit.
When we take backups Response that you don’t have a good backup until that backup has been restored. This means that someone needs to take the backup, restore it to the SQL Server and verify that the database can be restored. With the native backups I can do this very easily and roll the logs forward as much as I’d like, all without any risk of performance impact to the production systems. When taking snapshots as backups we now have to attach every snapshot to test it as each backup is totally independent. This requires attaching the snapshot to another server and attaching the databases, which depending on how much data is in the transaction logs as active and needing to be rolled forward or backward could put a lot of stress on the production disks which are being shared with the snapshot (see number 4 above).
See previous comments about cloning a zero-space copy to see if your backup was any good. You are correct in that a backup is worthless until you have proven its viability. The aforementioned technologies afford you those capabilities and do so without consumption of additional disk space and entirely under your control. All of these activities can be done on the data that has been replicated to the BC/DR site or other named spindles. None of the work has to be done on the production spindles. Again, this point (9) is either scaremongering or a demonstration of a misunderstanding of available technologies.
10. If I want to encrypt the backups, I don’t have that option with SAN snapshots.
Given that the database backups will at some point be leaving the secure data center they need to be encrypted so that if the tapes are lost the database backup is useless to who ever finds the tapes. As the SAN snapshots can’t be encrypted this means that we have to rely on the encryption process within the tape backup vendor who may or may not be doing encryption correctly, and they may or may not put the keys in the same place as the backup. While SQL Server doesn’t have an encryption option (other than TDE) as a native feature there are several third party backup products which can encrypt the database backups as they are taken which are known to be secure.
Completely false. The snapshots are simply one’s and zero’s on disk. The SAN vendor doesn’t care about this. You as the SQL professional are responsible for this. If you did a clone of that snapshot to another server it’s useless without the keys for decryption that you provide. In actual fact you are doubly wrong because storage vendors can do disk level encryption so snapshots can be encrypted because the physical disks are encrypted, which is way much efficient and faster than any software implementation. Don’t, however, go implementing both levels of encryption without discussing this within your infrastructure teams.
In conclusion, I usually recommend that my SQL Server clients do not use database snapshots. Hopefully if you are being pushed into using SAN based backups like this person was, you can use this as some reasons not to.
Of course, everyone is entitled to their opinion but those sharing their opinion ought to seek out some help from those people who have an understanding of some of the capabilities that the storage vendors can offer. Your article does read like a laundry list of scary stories that would lead an inexperienced admin down the route that DAS is the only way forward, or that, at the very least, not to do snapshots. That would be a mistake because, as has been pointed out, much of this article is false, but also because the advanced cloning and replication tools afford the admin a rapid way to present data to test & dev type servers without consuming additional disk, wasting time and spending cycles liaising with colleagues in the storage or backup teams who are busy doing their own work.
You are correct if your assumption and information is correct or this article was published 10 years ago; the article was built over total false information and ignorance of modern SAN storage capabilities, I have issue with this article as it speaks to thousands and millions of people following that respectable site and with such amount of false information, it caused more confusion and deviated from the initial intention.
I was fortunate in my life to realize that the world is not Microsoft only, I am a Microsoft lover and fan; no doubts, but Microsoft is built over other technologies like servers, network and SAN. Thus, I recommend to every Microsoft consultant and architect out there to learn those technologies or even understand the basics to be able to judge and build the proper integration decisions, not to repeat the mistake done in the above article.