Game Day Exercise – Schuberg Philis
Game Day (aka DR Test) Scenario
The emphasis of this years’ Game Day Exercise (DR test) was on three major items.
- First of all we needed to know if the new employees that joined Schuberg Philis recently are familiar with and capable of executing a DR test. Over the last years we executed multiple DR tests, we documented large part of the execution of the DR test and the experienced engineers and customer operations managers know by heart how to execute a DR or any other major outage. We have three relatively inexperienced (from a SBP perspective) engineers in the team, and one engineer joined this team over a year ago. Therefor it was good to know if those engineers are capable of executing the DR without being kick started by the experienced engineers. Is the level of automatism versus improvise te right one.
- Secondly we emphasized more on the organizational aspect of the DR test than in previous years. The test was planned but not communicated, so the engineers didn’t know that they needed to execute a full DR. Hence we call it a game day exercise. The reason for this is that we needed to know how people react in case a major event occurs. The test was kicked off at 18:00 on a Monday night. By doing so we knew that there was a big chance that engineers are commuting from the office to home. In this scenario we could test perfectly how long it takes to create yourself a fully operational working environment including communication.
- Thirdly we tested the link with the SBP Business Continuity Plan for the first time since the BCP plan had been adjusted in september. Over the last years we learned that communication is one of the key success factors in a DR scenario. We also learned that communication with too much stakeholders is interfering with the execution of the test.
Relation to the tests of previous years
In 2009 Schuberg Philis implemented a redundant environment over two datacenters. As of that moment it was possible to execute DR tests in such a manner that the IT functionality could be available in a different datacenter than Schuberg Philis’s own datacenter. Part of this project was a full fail over test prior to go live. After go live a DR test was not executed in 2009.
The first DR test that was based on this architecture has been executed in 2010. The main goal of this test was to prove that the architecture was capable of a DR scenario and that the engineers were able to execute a fail over within the time frames that we agreed on. Te test itself was a graceful test. This means that the entire functionality was shut down at one datacenter and activated at our second datacenter. At that time majority of the applications were active-passive. This proved to be the best starting point. We used this DR test to script and document large parts of the DR.
The 2011 DR test had a totally different character. Knowing that the architecture is capable of a DR and knowing that a real DR will not be graceful at all, we decided to take a more drastic approach. This time we needed to test how resilient the environment really was in case we lost datacenters the hard way. Next to that we had implemented sync mirror storage, so some of the applications ran active-active. This gave us also the opportunity to run primarily active in a different datacenter than Schuberg Philis. This architecture change needed to be tested as well. During the test we shut down power in the racks of the Datacenter 2 and the Datacenter 3. This led to a massive chain of events that needed attention and recovery. Over 1.400 Nagios alerts makes you go back to your knowledge and your experience. Make the environment reliable again by checking and fixing network, storage and virtualization. After that focus on apps.
In November 2012 we took the approach described above. The Game Day Scenario itself was a copy of an event that happened earlier in 2011. But in this test we exaggerated the event, so it was a good excuse to execute a real fail over. The scenario was the following:
17:35 | There is a fire in the Xxxxx building next door. The fire started around 17:25. We have been alerted by people in the SBP building about smoke on the streets, but also in the SBP building.
17:55 | The Fire Department came by to instruct people to leave the building as the fire is not yet under control. We can see the flames sky high through the roof. Smoke is getting more intense. Our Data Center Manager is alerted by security. The smoke detection is at a dangerous level. Only 2 ranks up and the fire suspension will be triggered. This means that we need to shut down our datacenter as fast as possible.
17:58 | Internal care takers (in Dutch Bedrijfshulpverlening ) is evacuating the building. Director of Operations is talking to the fire department because SBP is a 24/7 secured office. Security staff is informed to leave the building as soon as the building is evacuated. Police and Fire Department take over physical security of the surrounding.
17:59 | Director of Operations calls Customer Operations Manager itnernal IT on what to do. They decide that is it best to execute a DR in such a manner that SBP as a datacenter is not needed. the Customer Operations Manager calls the Lead Engineer who will initiate the DR.
Root cause for success and findings
The Game Day exercise itself was executed successfully. We experienced no major findings. This gives us great comfort that we will be up and running fast in case of a real disaster or act of god. However a number of minor findings need to be taken care off.
First of all the items that were executed successfully.
- Total timing of the DR and the fail back was 2:30 hours. In this time frame we did a manual failover, we checked the entire architecture (storage, network, virtualization layer, application layer) and on top of that we tested all functionality. After a successful failover we agreed to fail back as soon as possible. Major functionality used by our engineers to service customers (connectivity, documentation, procedures, ticketing system, passwords) was only interrupted briefly as anticipated.
- As all engineers were commuting or had to leave the building, it was good to see that a core team of engineers arranged a working spot in 10 minutes after the start of the test. The DR itself was therefor started promptly after the evacuation. In a test scenario, we had more time to start. However in a real disaster this might not be the case. So it is good to see that we can start faster than anticipated.
- The link between DR and BCP was done and proved to be working. The BCP Steps
- A. Identify and communicate incident,
- B. CMT (Crisis Management Team),
- C. Start McInfra DR Procedure.
- D. Execute evacuation and
- E. Align with customer and customer team on DR were done.
- Of course the scope was limited to the Schuberg Philis environment only.
The minor and medium findings that need adjustment are:
- Minor – Setting up a conference call with your mobile phone and need of that mobile phone to call different members is not working seamless. You need to have a mobile phone next to the conference call. Setting up a conference bridge in the hotel were we found ourselves a working spot. Proved to be not working. It is preferred to have sufficient conference bridges that have both an online and a phone connection. Next to that it is preferred to have online conference possibilities that is known by all staff (webex, gotomeeting and such).
- Medium – Last year we decided to split the communication between the Crisis Management Team and the Schuberg Philis DR test. This is still not working in an optimal manner. Calling all four engineers on duty and eight customer operations managers takes too much time. Experiences learns that each call takes up to three minutes. This means that the Schuberg Philis Customer Operations Manager is calling for almost 45 minutes in the beginning of the DR. As the status changes quickly in the beginning the Customer Operations Manager needs to align communication with the team that executes the DR test as well. A second Customer Operations Manager will be appointed to deal with this communication as well.
- Minor – Not all drawings (Rack Diagram) were available in PDF format. Visio format is too big for working on a remote location. A fix is easy to implement.
- Minor – A fail over of the Certificate Authority is not documented. The fail over was executed successfully. However this was because the engineer new exactly what to do by heart. This will be documented by a SOP (Standard Operating Procedure).
- Minor – The use of the SQL query to fail over the ERP system was not described sufficiently. This will be documented by a SOP (Standard Operating Procedure).
- Minor – The Load Balancer was not synced. We need to check configs of those.
- Minor – The order of communication was not done correctly by the Customer Operations Manager. On his way to the hotel the Customer Operations Manager called the other Customer Operations Managers. He should call the Engineers on Duty first. However as this is a rotating group it is not clear who the Engineers on duty are without looking in the pager duty tool (IRT). As a possible mitigation we could assign standard phone numbers that are switched automatically when a duty is changed from one person to another.
- Minor – a monitoring check was disabled during maintenance and not activated anymore. Who monitors the monitor?
- Minor – Not all Customer Operations Managers were contacted directly (holiday and such). As a result we decided to call the lead engineer of this team instead. This is not described in the procedure.
- Miror – Storage username password was not located in the Password safe. This needs to be added
- Minor – SBP Citrix servers couldn’t be drained without Citrix Tools. Those tools will be installed on the Management servers.
Nicre writeup.. and proud I was involved