On Thursday, February 22nd starting at approximately 4:56 p.m., most XMission customers experienced an outage that lasted for at least a half hour, depending on the service. XMission network administrators are still investigating the exact cause but it was related to ongoing upgrades to our core networking infrastructure.
Note that while most of our fiber customers experienced an outage it was due to our DNS servers going offline. Connectivity customers configured to use DNS elsewhere did not experience any downtime. For this same reason most of our Colocation customers, who typically have primary or secondary name service setup elsewhere, experienced no issues. Regardless, we recognize our customers rely heavily on the services we provide and do our best to ensure uptime. We are already reviewing ways to make our DNS infrastructure more resilient and don't recommend switching to another DNS provider unless you have a specific reason to.
Technical Details
As we migrate to a new and much more powerful core router (Juniper MX304), network administrators have slowly and carefully been moving connections off our old core1 (deliberately low-stakes connections, such as moving off our caching servers). Around 4:45 p.m., we started moving the fiber cabling for our management switch, which connects all of our networking equipment and is only used for accessing our networking equipment internally. Note that the new core router had already been routing for this management VLAN for a day without issue.
Unfortunately, after moving the fiber for this management switch to the new router, people started reporting many core services as down (e.g., our DNS servers, mail infrastructure, VoIP, and web hosting). Administrators quickly noticed the old core router was running a high CPU load due to the connection with our switch that handles much of our core server infrastructure. This switch also had a high CPU utilization and was constantly freezing up. After rebooting the problem switch it came back online and our services slowly started resolving again.
Conclusion
Network administrators continue to investigate the root cause of what essentially became a cascade event. Upgrading core infrastructure always comes with some risk which we do our best to mitigate with industry best practices. No further migration to the new core router will take place until we can identify yesterday's trigger so we can avoid another event. Note that to resolve the issues with the server infrastructure switch we moved it to the new core router.
These upgrades, including going from 20G up to 100G at the Seattle Internet eXchange (SIX) this month and an upcoming peering presence in Denver, are essential to continue providing you our customers with the best service possible. We apologize for yesterday's outage as we know these services are critical for your homes and businesses. We have more exciting improvements coming soon. Thank you for your continued patronage.