On May third and fourth, issues regarding virtual machines on cloud computing nodes caused intermittent connectivity issues throughout the night. During this incident, many virtual machines in Orlando were experiencing packet drops and an inability to connect to the internet. Not all machines were affected, even within the same computing node; however, our monitoring indicators did alert us to a degradation in performance.
Initial monitoring alerts came in around 1:10AM EST, and subsequent alerts at 1:20AM EST indicated a broader issue affecting multiple nodes. Our cloud team had already been informed of these events and the decision was made to bring in members of the cloud R&D team to assist with troubleshooting. It was noted by our engineers that some virtual machines were missing network security filters and were causing network traffic within the nodes to not properly flow and have occasional interruptions. Our cloud team had linked the missing network security filters to a single computing node, and once changes were implemented to update the networking rules, connectivity had been properly restored, and our monitoring indicated issues being resolved at 6:00AM EST that same day. During the update of the networking rules, a set of system optimizations for the node was also applied.
At around the same time the next day, at 1:07AM EST, a similar alert was raised, this time by another computing node. Our cloud team (including R&D) had once again been immediately notified and begun their investigations into the issues. Cloud had identified more virtual machines that had networking rules missing for them, and this was identified as the cause of the network latency and dropped packets. They had begun to apply the same network rules and optimizations to multiple nodes in Orlando and had completed this at around 5:00AM. With further inspection, they also found three servers that were configured to support high availability, along with no networking rules. Once the networking rules were corrected on these last three machines, the issues were resolved at 6:24AM EST.
This event prompted us to begin a review of all our environments and datacenters to make sure this issue wouldn’t arise elsewhere. We apologize for the disruption in service and are committed to ensuring this same incident will not occur again in the future.