Due to an engineering error last Monday, portions of the Google Cloud lost customer connectivity last Monday for approximately 70 minutes after Google network engineers manually connected a new peering link, bypassing the system of automatic checks that validate such links when proper procedures are followed.
The error made the europe-west1 region Google Compute Engine unreachable from a subset of destinations, primarily in Eastern Europe and the Middle East. The issue was strictly with the network, not affecting Compute Engine instances in the same region in other locations. Traffic strictly within the Google network was also unaffected.
The problem was caused by the addition of a new link to a global peer with whom Google was already connected. The engineers brought the link up manually, not realizing that the link would advertise far more capacity than was actually available. Network systems automatically routed traffic to the new, seemingly high capacity link, and four minutes after the link was created it was saturated and started dropping the majority of the network traffic routed through the link.
The process was done manually because the automation that would normally have handled the link and its associated safety checks was down, according to Google, due to an unrelated failure. This automation is expected to protect the network from problems such as the one that happened for one hour. Due to the automation issue, the problem was not discovered for 61 minutes because the post-activation checks that would normally have been performed during that hour were not available and the problem was discovered when the normal system monitoring took over.
To prevent this specific problem from recurring Google is changing the operations policy and no longer allowing these links to be brought up manually. In the future, the automation system needs to be fully operational before additional links will be added.
Leave a Reply