8 steps to prepare for a major IT outage
Columnist Gene Rogers on making sure you know what to do when IT systems go down
When things go wrong in the enterprise what do you do, who do you contact, and how do you fix it fast? A lot depends on how well you have prepared.
We never want it to happen, but let's face it, it will. So what do you do when it does happen? Make no mistake here. You have to prepare, you must have a plan, and you must practice executing the plan BEFORE a crisis happens.
There are 8 things you need to do to get yourself ready to manage a major incident.
#1. Determine what constitutes a major incident
Do this ahead of time!! You must determine how long each of your major systems can be down before there is a major impact on your company. How long can your invoicing system be down before you start losing money? How long can your web site be down before you lose customers? What about your HR system?
Work with each of your business units to find the impact. (This is called a Business Impact Analysis.) Get an agreement with them, in writing, about how you will handle these situations, including who you will contact within their business unit in the event of a major incident and the frequency of the updates.
This is so critical it bears repeating: With your business units, create a definition of a major incident and how long their systems can be down before critical impact occurs. Define how you are going to respond, and define who your team is going to contact in a crisis, and the frequency of the updates
#2. Create a crisis team with at least one primary and one backup contact for each service area
Telecom & Mobility,
Hosting & Storage,
Applications (one for each enterprise application - CRM, ERP),
Help Desk Manager,
IT Service Director
You need to also include contacts from your major vendors. Make sure these are not the sales contacts, you need the technical team.
#3. Create a crisis checklist
This takes the guess-work out of managing the situation. Distribute the checklist to all the potential participants prior to a crisis happening.
#4. Create communication templates for email, text, and voice communications
You also need to think about the frequency of communications and the distribution list. Doing this in advance will not only improve the quality of the communications, but it will give you back time during the crisis in the form of not spending time on creating communications when you should be troubleshooting.
#5. Create an automated alert from the Help Desk for all the potential crisis management team members
This is the fastest way to let everyone know that an outage has happened and what they need to do. My experience is that text messaging works best, but also send an email with the word "URGENT" (or something similar) in the header.
#6. Create a pre-defined conference line for tech folks to call into and a different line for business folks
It's important to separate the two. The tech team needs a dedicated line so they can troubleshoot. The business folks need a conference line to discuss which customers are impacted, how to communicate with those customers, and what manual processes may need to be triggered.
#7. Create a script for the conference line to keep order
Follow this script at the beginning of the calls and at the start of each update (usually at the top of every hour).
Send out practice alerts to both primary and secondary contacts and open the conference call lines. After a few dry runs, it should take no longer than 10 minutes to activate the crisis team. Even if you do not have a major incident, practice getting the team engaged at least once per quarter. When something goes wrong, for real...you will be glad you did!
Following these 8 tips will get you prepared to manage a major outage. Preparation and practice are the most critical success factors, so get cracking on making sure you are ready for when things really go wrong in your enterprise.