On August 30, 2020, Level 3/Century Link, AS 3356 had major Internet outage. In fact this outage effected massive amount of networks, including very well know ones such as Amazon, Microsoft, Twitter, Discord, Reddit etc. 3.5% Global Internet Traffic was dropped due to this outage and entire network converged after almost 7 hours.
This is huge amount of time. When we usually discuss convergence, specifically fast convergence, 'Seconds' if not ' Milliseconds ' are the target values. No one wants to have minutes level network convergence.
But when there is an Outage like this, we categorize them as ' Catastrophic Failures' and unfortunately network design usually doesn't take this kind of failures into an account. But could it be prevented? In the first place, let's understand that, this event, similar to many other catastrophic network events, started at a single location. (According to a CenturyLink status page, the issue originated from CenturyLink's data center in Mississauga, a city near Ontario, Canada.)
But it spread over entire backbone of AS3356. In fact, I remember on 2014, which we famously know as 512k incident happened because of this network (Level 3) as well and that event also caused Global outages!.(Default Free Zone/ Global Routing Table was exceeded 512000 prefixes for IPv4 Unicast, Level 3 was announced 30k prefixes, and it was one of the defined Route Leaks in RFC 7908) August 30, 2020 issue was not Route Leak though.
It happened due to bad Flowspec (RFC 5575) rule. In fact, Flowspec based outages happened many times in the past and CloudFlare Flowspec Based Outage was famous one. Flowspec is an extension for the BGP that allows companies to use BGP routes to distribute firewall/policy rules across their network.
Flowspec announcements are usually used when dealing with security incidents, such as BGP hijacks or DDoS attacks, as it allows companies to change their entire network to react and mitigate attacks within seconds. I usually explain it as the more flexible version of RTBH (Remotely Trigged Black Hole), though there are many differences between the two.
Although Service Providers gives the customer to control their DDOS prevention with RTBH, they don't like to give control via Flowspec. Flowspec is for me no different than VTP or any other protocol, which the potential risks of it not greater than the benefits. So, you can bring the entire network down, relatively easily and tradeoff is operational simplicity.
Century Link/Level 3 took coordinated action with the other providers and they met in IRC to discuss the action, and other providers simply de-peered(disabled) their BGP session with the AS3356 until the problem is resolved. So, one more time we have seen how bad Flowspec rule would create AS (Autonomous System) wide outage easily, and when this AS is one of the Global Tier 1 company can effect millions of people again.
Hope this time we take lesson, and take Routing Security seriously! BGP Security, Interdomain Routing, BGP Traffic Engineering, Ingress and Egress Peer Engineering and many other details with BGP can be found in my BGP Zero to Her Course. Cheers Orhan Ergun Century Link Outage!