ORL-FL | Network Latency
Incident Report for Atlantic.Net
Postmortem

Overview:  

On May third and fourth, issues regarding virtual machines on cloud computing nodes caused intermittent connectivity issues throughout the night. During this incident, many virtual machines in Orlando were experiencing packet drops and an inability to connect to the internet. Not all machines were affected, even within the same computing node; however, our monitoring indicators did alert us to a degradation in performance.  

On May third:  

Initial monitoring alerts came in around 1:10AM EST, and subsequent alerts at 1:20AM EST indicated a broader issue affecting multiple nodes. Our cloud team had already been informed of these events and the decision was made to bring in members of the cloud R&D team to assist with troubleshooting. It was noted by our engineers that some virtual machines were missing network security filters and were causing network traffic within the nodes to not properly flow and have occasional interruptions. Our cloud team had linked the missing network security filters to a single computing node, and once changes were implemented to update the networking rules, connectivity had been properly restored, and our monitoring indicated issues being resolved at 6:00AM EST that same day. During the update of the networking rules, a set of system optimizations for the node was also applied.  

On May fourth:  

At around the same time the next day, at 1:07AM EST, a similar alert was raised, this time by another computing node. Our cloud team (including R&D) had once again been immediately notified and begun their investigations into the issues. Cloud had identified more virtual machines that had networking rules missing for them, and this was identified as the cause of the network latency and dropped packets. They had begun to apply the same network rules and optimizations to multiple nodes in Orlando and had completed this at around 5:00AM. With further inspection, they also found three servers that were configured to support high availability, along with no networking rules. Once the networking rules were corrected on these last three machines, the issues were resolved at 6:24AM EST.  

Conclusion:  

This event prompted us to begin a review of all our environments and datacenters to make sure this issue wouldn’t arise elsewhere. We apologize for the disruption in service and are committed to ensuring this same incident will not occur again in the future.

Posted May 08, 2024 - 14:58 EDT

Resolved
This incident has now been resolved, and we are no longer seeing any recurring issues. Our Cloud engineers will be on standby once again throughout the course of the day and the next night in order to ensure no more irregularities on our cloud platform will arise once again.
Posted May 04, 2024 - 10:58 EDT
Update
We have identified and addressed the root cause of the increased network latency. All customer Environments in the Orlando data center are now functioning normally. Our team is continuing to work on this issue and will keep monitoring the network to ensure that latency remains at normal levels.
Posted May 04, 2024 - 09:01 EDT
Monitoring
The situation has improved. We will continue to work towards a resolution.
Posted May 04, 2024 - 06:27 EDT
Identified
We are in the process of implementing a new solution
Posted May 04, 2024 - 04:48 EDT
Investigating
The issue has not improved and is still being investigated
Posted May 04, 2024 - 03:51 EDT
Monitoring
The cause of increased network latency has been identified. Our team has implemented the necessary fixes to address the root cause. We will monitor the network closely to ensure that latency returns to normal levels. Further updates will be provided as the situation progresses.
Posted May 04, 2024 - 03:27 EDT
Identified
The cause of increased network latency has been identified. Our team is working on implementing the necessary fixes to address the root cause. We will monitor the network closely to ensure that latency returns to normal levels as the fix is applied. Further updates will be provided as the situation progresses.
Posted May 04, 2024 - 02:51 EDT
Investigating
Our monitoring systems have notified us of potential network latency issues for this data center. Our engineers are currently investigating this issue.
Posted May 04, 2024 - 01:57 EDT
This incident affected: Regions (Orlando, Florida (USA-East-1)).