This blog post is about an actual real life, local network situation that happened several days ago. This is an excellent learning example about a perfect storm of small things causing a network outage. It is also an excellent learning example of how logs of what happened in public BGP via MRT files can be extremely useful.
AS53443 was turning up a second BGP transit provider, and during the turnup process leaked routes from their first provider to the second provider. The second provider accepted these routes and installed them, causing most (535,292 IPv4 prefixes out of roughly 760,000, or about 70%) of “the internet” from a BGP perspective to be outbound-reachable thru a customer link for about 13 minutes, and then a second time 25 minutes later for an additional 3 minutes (295,143 prefixes this time). Since this customer link is much smaller than the backbone links, this caused extreme congestion outbound and basically took AS7122 off the internet.
AS53443 was turning up BGP with a second provider (AS7122) and likely didn’t have outbound prefix filters, or they weren’t applied as expected.
A couple points to note here:
Result: DFZ routes from AS6327 leaked to AS7122.
Due to a prefix filter not being in place, AS7122 installed routes to roughly 70% of the internet outbound through this customer link to AS53443. This resulted in AS7122 seeing a large number of routes with shorter AS-PATHs via AS53443 through AS6327.
A couple of points to note here:
53443 6327due to their shorter path (hard to tell externally)
Result: AS7122 had outbound congestion to roughly 70% of “the internet” (by prefix count)
Start Time: GMT: Saturday, April 20, 2019 9:43:13 PM
End Time: GMT: Saturday, April 20, 2019 9:56:12 PM
Second Start: GMT: Saturday, April 20, 2019 10:21:04 PM
Second End: GMT: Saturday, April 20, 2019 10:24:10 PM
These times may not be exact, as they’re observed via BGP, but they’re pretty close.