This blog post is about an actual real life, local network situation that happened several days ago. This is an excellent learning example about a perfect storm of small things causing a network outage. It is also an excellent learning example of how logs of what happened in public BGP via MRT files can be extremely useful.
AS53443 was turning up a second BGP transit provider, and during the turnup process leaked routes from their first provider to the second provider. The second provider accepted these routes and installed them, causing most (535,292 IPv4 prefixes out of roughly 760,000, or about 70%) of “the internet” from a BGP perspective to be outbound-reachable thru a customer link for about 13 minutes, and then a second time 25 minutes later for an additional 3 minutes (295,143 prefixes this time). Since this customer link is much smaller than the backbone links, this caused extreme congestion outbound and basically took AS7122 off the internet.
The Perfect Storm
AS53443 was turning up BGP with a second provider (AS7122) and likely didn’t have outbound prefix filters, or they weren’t applied as expected.
A couple points to note here:
- Lack of filters wouldn’t be noticed with a single provider because of BGP loop prevention.
- Certain network operating systems make it difficult to enable prefix filters before the session comes up (Cisco IOS)
- The net result was that routes from AS6327 were leaked to AS7122 through AS53443.
- AS53443 is new to the internet this year, and when turning up a second BGP provider for the first time this is a very easy mistake to make.
Result: DFZ routes from AS6327 leaked to AS7122.
Due to a prefix filter not being in place, AS7122 installed routes to roughly 70% of the internet outbound through this customer link to AS53443. This resulted in AS7122 seeing a large number of routes with shorter AS-PATHs via AS53443 through AS6327.
A couple of points to note here:
- AS7122 from a BGP perspective is single-homed behind AS577 – who doesn’t publicly peer in Canada. This causes them to be artificially further away on the internet from other Canadian networks.
- Related, depending on config, AS7122 may have preferred routes via AS-PATH
53443 6327due to their shorter path (hard to tell externally)
- Or, also depending on config, AS7122 may have also preferred routes from AS53443 due to BGP customer relationship (hard to tell externally)
- AS7122, being an established telco, and having operated BGP in Manitoba under various numbers since the 90’s, SHOULD have a well established customer turnup policy that includes always filtering customer sessions – especailly new BGP customers.
Result: AS7122 had outbound congestion to roughly 70% of “the internet” (by prefix count)
The MRT Files
Start Time: GMT: Saturday, April 20, 2019 9:43:13 PM
End Time: GMT: Saturday, April 20, 2019 9:56:12 PM
Second Start: GMT: Saturday, April 20, 2019 10:21:04 PM
Second End: GMT: Saturday, April 20, 2019 10:24:10 PM
These times may not be exact, as they’re observed via BGP, but they’re pretty close.