Disaster Planning and Strategy for Windows Domains
Bad things happen to computers, sometimes unintentionally. It will at some point happen to you and this page will (hopefully) help you when it does.
Backup Strategy and Implimentation
Backups should be the second most important thing in your job, the first is having a system which runs. Before you allow users to put data onto your servers you need to think about how to back it up. If you don't have a backup system in place you are in trouble.
With unlimited funds backups really don't pose much of a problem, however most Sys Admins are working within very tight budgetary constraints.
Things to consider are;
- Who are you going to provide backups for?
- How much data do they have that needs to be backed up?
- How much of that data actually needs to be backed up?
- Are you sure? Ask this question (at least) twice
- How can you back it up? External RAID, Tape? (Central Backup or Client?) External hard drive, USB or DVD for personal backup?
- Do you have the resources to buy the backup device (and media!) to back it up?
- Are you going to provide a long term archive for data?
- Do you have so much data that you can't back it up in a single go over night? If so you may need a more advanced system than just one drive or set of drives, or maybe you can split the data up onto different servers?
Users notoriously hold onto data and will gladly fill up any space you allocate to them. To help limit this you should set a policy of how much you are prepared to back up and what you will and won't back up for users. Point out that you are not a repository for their MP3 collection or photos (Or maybe in a bid to be popular you are!). Provide a central store they (and your backup devices) can easily access and then limit the amount of space they have with quotas. If more space is required make them come and ask and justify it. At the least get the cash to be able to backup and restore in a sensible manner.
Backups are useless without the ability to restore the data. You should ensure that you regularly attempt a restore from the media.
Tapes wear out and become less reliable over time. You should change the tapes fairly regularly, a tape used daily is unlikely to be that reliable after a couple of months.
Make sure you can restore data to a different machine. If the server dies which your (only) tape drive was in then what do you do? Having an in-built tape drive may be cheaper, but its not very convenient if the hardware has a problem (which doesn't involve the tape drive of course). With an external drive you can at least quickly plug it into a different system - assuming it has a SCSI card for a SCSI drive and so on.
Disaster Scenarios :
Having to restore data is only part of your problem. Other nasty things can happen too, how prepared are you?
Think, What if? What happens if your main server fails? What will you do? The headless chicken dance is not a good response!
Active Directory Specific
Although Active Directory Domain controllers are all supposed to be equal, some in fact are more equal than others. You may well have certain data attached to certain systems, which is why it's better, if possible, to use network based storage. However there are some AD specific functions which will impact on you if a DC fails.
You must have DNS running for Active Directory to work and you should have two servers for fault tolerance. If your Primary DNS fails (the one with the Primary zone for your Domain name as a whole or the Underscore and other AD required zones) what will you do? If the SRV records can't be found clients will have to rely on cached credentials to access their machine with no local account, which may not be possible depending on your settings.
As long as you have two DNS servers within your Domain, both of which are listed in your clients IP configuration, you should be fine. If the server with the Primary zones fails you can promote the required zones from secondary to Primary zones on the second server.
If you are using Active Directory integrated DNS then you won't have a problem as all of the zones are effectively Primary zones in this state and you should also have DNS installed on all of your DCs in this configuration
If the worst happens and you loose all your DNS information, Don't Panic!. You just need to establish a DNS server, recreate the Primary zones, Point your DCs to the DNS server and re-start the Netlogon service. Re-starting Netlogon will cause the DCs to write the SRV records into the DNS when the service starts.
Operation Master (FSMO) Roles
If you don't know what the FSMO roles are or how to transfer them you should attend the Active Directory course run by Windows Support or at the least do some rapid swatting up on Active Directory. The FSMO roles are 5 specific roles assigned to one or more Domain Controllers within a Domain or Forest. If a DC with one or more of these roles fails you will get errors, depending on what role has been taken off the network.
The simplest solution is to be aware of which server holds which FSMO role and if that server fails know how to transfer or recover the role.
The first DC in your Domain and Forest will have all the Roles assigned. In a Single Domain environment further DCs will not get these Roles assigned unless you move them.
Only the PDC emulator will immediately impact on users, the other roles will impact on Sys Admins depending on the tasks being done. See below for guidelines.
Guidelines for dealing with FSMO Roles
- Only the PDC Emulator role will have an impact on users if the role holder fails. You should seize this role very quickly if the role holder fails.
- Schema master loss prevents modification of the schema, either manually or by an Application.
- Loss of the Domain Naming master will only be noticed if you try and add a Domain to the forest.
- The RID master will only be noticed if the Domain runs out of RID numbers (used when creating users and objects).
- The Infrastructure master will be noticed if you are moving or renaming large numbers of accounts.
- If you get warning of a server failure or believe a server holding the roles will fail, transfer them to another server before it fails.
- If you sieze a role, do not put the original role holder back onto the network with Active Directory still installed on the server - ever. This includes restoring a System state as well! Wipe the system and follow the instructions in MS knowledge base article 216498 http://support.microsoft.com/kb/216498/EN-US/ to clear the debris from your Active Directory
You can transfer and sieze the roles using ntdsutil.
NOTE : In order to see the schema management MMC snap-in you need to register the schmmgmt.dll using Regsvr32, do Start Run Regsvr32 schmmgmt.dll.
The global catalog server contains a subset of AD information which must be available at all times. You can use Replmon from the support tools to find out which server is the global catalog server or look in Active Directory Sites and Services. If the Global catalog server fails you can create a new GC by using AD Sites and Services. Select a server, and open the view so you can see the NTDS Settings opbject. Get the properties for the NTDS Settings object and put a tick in the box for Global Catalog server.
NetBIOS netbt networking is still used with Active Directory and DNS and is required for Microsoft network browsing without WINS. If your Master browser fails you will have problems using Microsoft networking until a new master is established. You should have at least 1 browse backup master in your Domain. Use the Browstat command to resolve NetBIOS browse issues. With Server 2008 the computer browser service is disabled by default, you may need to re-enable this.
You can avoid these problems by using WINS and making sure your PDC emulator is running (This is the server which will be your Master browser).
Guidelines for NetBrowse Issues
See the following MS knowledge base article for information http://support.microsoft.com/kb/818092
Be aware of the Registry settings which control these settings; HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Browser\Parameters
- Make sure you know which of your servers is the browse Master
- Make sure you have at least one backup browse master - it should be a server for preference.
- To re-register NetBIOS name, run nbtstat -RR command.
Some general things to think about;
- Pick a component in your server, any component, then think about what happens if it fails. How quickly can you replace the PSU?, CPU?, RAM? Do you have ANY spare parts?
- Do you actually need spare parts? Can you source basic replacement kit the same day from a local shop?
- If not can you get them delivered (and authorised without forging a signature) within 24 hours? If the person who signs the cheques is away then what?
- If a server fails can you transfer its roles to another server? If you can is it worth the time and effort involved waiting for a spare/replacement bit of kit. If its worth it how can you go about it?
- What happens if the air conditioning breaks? Or leaks?!
- Is the air conditioning installed above your servers? Is there water there?! How do you turn off any water supply? A large Plastic sheet can be VERY handy if your air conditioning unit is above or even near your servers and you get an unwanted ice build up.
- Power failure? What no UPS?
- Do you have a fire extinguisher in your server room or near your servers? One you know how to use? One you know how to use that isn't water? Can it be used on electrical equipment?
- Building on fire? (hopefully not because of the previous point) That's ok, You keep a (recent!) set of backup tapes to do a FULL restore of your data in the fire proof safe/off site, right?
RAID Specific Considerations;
- Spare card? If your RAID card fails and you are using RAID 5 you can't access the data without another RAID card, keep a spare or be sure you can afford to wait for a new one.
- Where are your spare disks for your RAID server? Not got any? Ooops.
- Are your RAID drives hot swappable? How?
- If not hot swappable are you comfortable in identifying a failed drive? Re-building or replacing the wrong drive in a RAID set is a BAD thing to do!
- Do you actually know how to re-build a failed disk?
- Did you bother to check how long a re-build of a full disk is supposed to take? Lets you warn the users that it will take at least X amount of time for service to be restored. Tell them this for no reason other than it will keep (hopefully) most of them away from your door until that time has passed.
- Where DID you put the manual? Can you find the manual online?
You can't think of everything, but even if the worst happens having thought some of the possible events through before hand, such as what happens if your main file server dies, will save you a lot of stress when it does happen! Because when it does happen everyone will be looking at you.