View Single Post
Posts: 35 | Thanked: 504 times | Joined on Jan 2013 @ Germany
#1
Here's the promised complete post mortem of everything that happened.

What was planned:
Do a backup of our VMs and a software upgrade to all machines on a Saturday. I estimated about 2 hours for that, xes corrected me into 6.

Turns out it went into 10 days of frantic firefighting colliding with day jobs, family and giving talks at conferences.

What happend:

Saturday, 19.11.2016
10:00 - start updates and backups on blade-a
14:30 - backups and updates complete on blade-a, reboot confirmed successful
14:31 - uptime induced filesystem check after 1347 days
15:00 - start of backups on blade-b
17:12 - filesystem check complete, blade-a up and running
17:30 - first systems on blade-a confirmed up and working
18:30 - software upgrade on stage and mail complete
20:15 - backups of blade-b finished and copied onto blade-a backup space
20:16 - start of updates on blade-b
21:00 - updates on blade-b complete, reboot
21:01 - blade-b stuck in boot with corrupt bios image in flash
23:30 - all available remote recovery options tried, none working
23:40 - decision to go for Plan B, boot talk.maemo.org on blade-a, redirect everything else to talk.m.o
23:45 - blade-b turned off through IPMI
23:53 - talk.m.o available again

Monday, 21.11.2016
16:00 - Datacenter visit, trying to boot blade-b with attached USB key for BIOS recovery
18:00 - No recovery possible, Board hangs with "A9" after attaching USB devices, decision to swap board. Unable to swap board directly because Hardware has to be powered off and removed from the rack, there are APC powerstrips in the way.

Tuesday, 22.11.2016
22:30 - Starting backup of stage.m.o

Wednesday, 23.11.2016
21:30 - stage.m.o available again

Saturday, 26.11.2016
16:00 - Again, Datacenter visit, Security Guard doesn't show up until 16:45
16:45 - Swapping CPU and Memory to a spare board in the chassis
17:30 - Powering up spare board, removing udev.d rules, thereby accidentally swapping network interfaces of blade-b
19:30 - Upgrade of both blades to latest kernel and XEN version with security patches
21:40 - Correction of interface udev rules to match interface names to physical interfaces
22:20 - Every thing considered to be fully operational, applying finishing touches

Sunday, 27.11.2016
00:20 - maemo.org infrastructure declared operational

Monday, 28.11.2016
05:22 - blade-b kernel: NETDEV WATCHDOG: eth1 (igb): transmit queue 3 timed out
Network interfaces of blade-b started to reset due to bogon emissions
Affected systems: www, wiki, garage, builder, vcs
19:20 - reload of igb kernel modules fixed the condition, everything working again.

Tuesday, 29.11.2016
05:22 - blade-b has the same error as on monday
15:20 - fixed by reload of igb kernel module
15:40 - reboot of blade-b, disabled APSM

Total time spent staring at screens and in the colocation: 60+ hours

Time without serious hiccups: 1234 days (which was the uptime of both blade-a and blade-b)

So if you want to thank your admins, here are our wishlists:

xes - (prefers giftcards from amazon.it to xes.maemo (at) gmail.com) https://www.amazon.it/dp/B005VG4G3U
falk - https://www.amazon.de/gp/registry/wishlist/168D3W6163KG

Best,

xes & falk
 

The Following 48 Users Say Thank You to fstern For This Useful Post: