Outage - Do you have a planned outage process?

Updated by Rob Thomlinson SSW 11 months ago. See history

123

For unplanned outages, see Outage - Do you have an unplanned outage process?

If your servers are down or have to go down during business hours you should notify the users at least 15 minutes beforehand so you will not get 101 people all asking you if the computer is down.

For short outages (under 15 minutes) that only affect only a few people (under 5 people), or are outside of business hours, then IM is the best method. If you use Microsoft Teams or Skype, a quick message will do.

Note: If they are not online on Teams or Skype, then they can't complain that they were not warned.

For extended or planned outages, or if you have a larger number of users (50+), email is the suggested method.

Email

If you send an email it is a good idea to tell the user a way to monitor the network themselves. Eg. Software solutions like SCOM or WhatsUp Gold.

Include a "To myself". It gives visibility to others who are interested in what needs to be done to fix the problem and makes it easier to remember to send the 'done' email. E.g. "done - CRM is alive again".

::: email-template

To:SSWAll
Subject:Planned OutageDate and time
::: email-content

Hi All

Here is the summary of the outage plan:

Planned/Unplanned:Planned
Change Description:Install Windows Updates and Restart Server
Risk (see table below):LOW RISK (LOW Probability and MEDIUM Impact)
Reason For Change:Windows 2016 Windows Updates
Uptime over last month:91.361%
Planned Outage (mins):150
Planned Start Time:26 October 9:00 PM
Planned Finish Time:26 October 11:30 PM
Affected Services:\Windows Server 2016 , sharepoint.ssw.com.au intranet.ssw.com.au , projects.ssw.com.au

Risk Lookup Table by Probability and Impact:

Image

Figure: Clearly showing the potential risks

Note: The following servers will be affected:

and

To myself

To show others who are interested in what needs to be done to fix the problem:

Detailed Change Plan:

  1. Lockout users via IIS
  2. Backup server
  3. Install Windows Updates
  4. Reboot server
  5. Follow test plan
  6. Based on result of test plan, follow backout plan if procedure failed
  7. Procedure completed

Test Plan:

  1. Check Event log for errors
  2. Check each affected service is running
  3. Call test users to start “Test Please” on the affect services
  4. Get result of user “Test Please” by email by 11:15 PM

Backout Plan:

  1. Restore server from backup

Note: <This is as per rule Outage - Do you have a planned outage process? >

::: :::

Figure: Example planned outage email

Pre-Outage Checklist

Immediately before the scheduled downtime, check for logged in users, file access, and database connections.

Users

Run | Taskmgr | Users' tab | Check active connections | Request users to log off

Image

Figure: Connected users can be viewed in Task Manager

Files

Run | compmgmt.msc | System Tools | Shared Folders' | Review 'Session' and 'Open Files' for user connections

Image

Figure: Computer Management 'Open Files' View

Database

SQL Server Management Studio | SQL Server Connection | Activity Manager

Image

Figure: SQL Management Studio 'Active Connections' View

Once these have been checked for active users, and users have logged off, maintenance can be carried out.

Restarts should only be performed during the following time periods

  1. Between 7am and 7:05am
  2. Between 1pm and 1:05pm
  3. Between 7pm and 7:05pm

If a scheduled shutdown is required, use the PsShutdown utility from Microsoft's Sys Internals page.

Always reply 'Done' when you finish the task.

acknowledgements
related rules