Monday, April 25, 2011

Backup Strategies for SQL Server

References:
http://msdn.microsoft.com/en-us/library/ms191164.aspx
http://msdn.microsoft.com/en-us/library/ms190217.aspx

There are two recovery models in SQL Server:

Simple: Used when some data loss after recovery is acceptable. The absence of separate log backups simplifies managing backup and restore. . However, a database can be restored only to the point of the most recent backup.

Full: (Full is really a misnomer here, since both are full db backups.  A better name would be Complex) The full recovery model uses log backups to minimize data loss.  The advantage of using log backups is that they enable point-in-time recovery.  Assuming you can back up the active log after a disaster occurs, you can restore the database up to the point of failure without data loss. The disadvantages of using log backups are that they require storage space and increase restore time and complexity.

For most Dynamics GP installations, the Simple recovery model will be sufficient. 

The first step is for the client to determine how much data loss is acceptable
For example, in the event of a drive failure, how many days of work would it be OK to have to re-key?  Keep in mind that the shorter the interval, the more expensive and labor intensive the backup solution will be.

Set the backup interval of the full or differential backup to a frequency that will keep data loss to the required amount
For example, if the client determines that 1 day is acceptable, set the backup frequency to nightly. 

If the frequency must be less than 24hours, the full recovery model is a better fit. 
In that case we set the full database backup to nightly and set differential log backups more frequently. 
The differentials will add more complexity to the restore process, but they will be less resource intensive during business hours and will take up less drive space.

Backups should be stored on a different physical drive than the application, database, and log files.

Backup files should be periodically shipped off site
There are a variety of acceptable ways to accomplish this.  One acceptable disk storage solution would be:
Drive C:  Single drive for Applications
Drive D:  RAID 5 disk for data files
Drive E:  RAID 5 disk for log files
Drive F:  Single drive for tempdb
Drive G:  RAID 1 disk for backups

While much of the backup process can be automated, there is a certain amount of labor required. 
Let's say a company uses the above drive configuration and uses a simple recovery plan with weekly full backups and nightly differentials.  The weekly full backups are transferred to one or more portable drives (or the RAID 1 slave drive is swapped out) that are then secured off site.  Thirteen months of history will be kept this way. 

In this scenario, the files will continue to collect on Drive G.  Someone will have to retrieve the portable drive, attach it to the network, transfer the weekly backup files, delete files older than 13 months from Drive G and the portable drives, and take the portable drives back off site.  Changing the procedure to copy files to a remote location only removes the labor associated with transporting the drives and it substantially increases the cost.

The most important labor process, however, is a yearly disaster recovery simulation
If this is skipped, the entire backup plan is put in jeopardy. 
There must be a written plan for disaster recovery and someone who knows how to carry it out. 
A good disaster recovery plan will have every conceivable scenario covered.  Some examples of scenarios to plan for:
1. A sudden event (such as a fire or theft) destroys the hardware.
2. A slow event (such as a hurricane or flood) threatens the hardware.
3. An event (such as discovery of asbestos) makes the current location uninhabitable for an extended period, but the hardware is intact. (In this case, there is no reason any data needs to be lost)
4. External malicious intrusion destroys or corrupts data
5. Internal malicious event destroys or corrupts the data.
6. Internal non-malicious event destroys or corrupts data.

Each of these events should have a plan spelled out, and in most cases note who is responsible for carrying it out.  It should also spell out what is to be done after the disaster to recover.  The responsible parties should be given a chance to practice the plan at least once a year. 

To assist in writing each part of the plan, I have included some questions that the plan should address.
1. A sudden event (such as a fire or theft) destroys the hardware.
Where is the hardware inventory kept?  Who maintains it?
Who will be responsible for determining what is missing/damaged?
Who will notify management?
If the SQL Server is damaged, who will restore it?  To what point?
Who will notify users?

2. A slow event (such as a hurricane or flood) threatens the hardware.
How long before event impact will emergency procedures begin?  Who makes this call?  Who notifies users?
Who makes the call to evacuate?
What will be done to protect user hardware?  Who is responsible?
What will be done to protect network infrastructure?  Who is responsible?
How long before event impact will final backups be made?  What will be backed up?  Who is responsible?
What hardware will be evacuated?  How long before event impact?  Who will prep it?  Who will transport it?
Who will assess damages after the event?
Who will make the call to restore operations?
What procedures will be used to restore the network infrastructure?  Who is responsible?
What procedures will be used to restore user hardware?  Who is responsible?

3. An event (such as discovery of asbestos) makes the current location uninhabitable for an extended period, but the hardware is intact.
Will the hardware be moved? (If so, ask questions from #2)  Who makes this call?
Will the network be replicated elsewhere or will the company access the current network remotely?
Same questions as #2

4. External malicious intrusion destroys or corrupts data
5. Internal malicious event destroys or corrupts the data.
6. Internal non-malicious event destroys or corrupts data.
Who will notify management?
What will be done in the short term to prevent further intrusion/corruption?  Who is responsible?
Who will notify users?
Who will be responsible for hardening the system after the intrusion?
If the SQL data is damaged, who will determine what point to restore to?  Who will restore the data to this point?
What procedures will be followed to restore to that point in time?

For example, in Florida there are hurricanes yearly.  This is a slow event.  The recovery plan should state what hardware/data should be taken off site during evacuation, and who is responsible for doing this.  It should also state what procedures should be followed before abandoning the hardware site.  Once a year, the responsible parties should be permitted (during down time) to actually practice following the procedures and removing the hardware/data (going so far as to loading it for transport), then bringing it back in and reintegrating/restoring (in a test environment if necessary).

Here is a partial example of a recovery plan (notice that the SQL backups are on a RAID 1 disk, so would be taken off site).:
Disaster Recovery Plan
1. Fire, theft, or other sudden event

a. The IT director will keep an updated inventory of all infrastructure hardware and will use this list to determine hardware loss.
b. ...procedures for notification of management team and users
c. In the event that the SQL Server machine or storage drives have been damaged:
- ...Procedures for procuring new hardware or using existing
- If necessary, SQL Server will be installed
...same drive configuration as above
- The portable drive will be retrieved from off-site and the last weekly backup will be restored.
- If the G Drive is intact, the last nightly or differential backup will be restored.
d. ...procedures for other parts of the network

2. Hurricane or other event where warning is given.
  a. The IT director will monitor the likelihood of the disaster affecting infrastructure when the disaster moves within 3 days and will send daily updates to the management team.  The management team may make a decision to evacuate before the government orders a mandatory evacuation.
  b. Upon issuance of a mandatory evacuation order, or upon direction from the management team, the IT director will issue a 2 hour warning to all employees via email before beginning safeguarding procedures.
  c. All employees will be responsible for backing up their own local data to a portable device to take with them. Employees should also take their laptops and portable projectors with them as they evacuate.
  d. At the end of the 2 hour warning, IT staff will execute all weekly backup jobs 
  f. IT staff will be responsible for ensuring that all user hardware is shut down and unplugged.
  g. Once the jobs are complete, all network infrastructure will be shut down and physically disconnected from power sources.
  h. The IT director will keep an updated inventory of all infrastructure hardware.  The IT director will evacuate with this list and all RAID 1 slave drives and the portable off-site backup drives.
  i.  At the earlier possible time after the disaster has passed, the IT director will return to the premises with the management team to assess the damages.
  j.  Once the management team gives the direction to bring the systems back online, the IT team will replace the RAID 1 slave drives and power up the network infrastructure.  Once all systems are online and verified, the IT team will reconnect the employee hardware to power.


For the purposes of SQL Server recovery, it is important for the IT staff to practice step c above
Practice will not only bring up questions about how to install/recover, but it will also bring attention to any areas of the backup plan that are failing (such as a bad backup job, failing drive, or missed database).  It is not necessary to spend a ton of money replicating your entire server rack.  Simply adding a few cheap drives to the existing network can be enough to provide an area that staff can practice restoring hardware.  Remember, hardware and off the shelf software are easily replaced.  It is your data that is valuable.  Consider how many man-hours are spent generating that data each year and you will quickly see that it is much cheaper to spent a little money to plan ahead than to regenerate that data.


Did this help you?  If so, please leave a comment!

No comments:

Post a Comment

SQL 2022 TSQL snapshot backups!

  SQL 2022 now actually supports snapshot backups!  More specifically, T-SQL snapshot backups. Of course this is hardware-dependent. Here ...