BGP PIC Fundamentals: BGP PIC ( Prefix Independent Convergence ) is a BGP Fast reroute mechanism which can provide sub-second convergence even for the 900K internet prefixes by taking the help of IGP convergence.
BGP PIC uses a hierarchical data plane in contrast to a flat FIB (Forwarding table) design which is used by Cisco CEF and many legacy platforms.
In a hierarchical data plane, the FIB used by the packet processing engine reflects recursions between the routes.
I will explain the recursion concept throughout the post so don't worry about the above sentence, it will make sense.
There are two implementations of the BGP pic concept and they can protect the network traffic from multiple failures.
Link, a node in the core or edge of the network can be recovered under a second, and in most cases under 100ms ( It mostly depends on IGP convergence, so IGP should be tuned or IGP FRR can be used ).
In this article, I will not explain IGP fast convergence or IGP Fast reroute but you can read my Fast reroute mechanism article from here.
BGP PIC can be thought of as a BGP Fast Reroute Mechanism that relies on IGP convergence for failure detection. ( All overlay protocols rely on underlay protocol convergence ie LDP/IGP Synchronization, STP/HSRP, IGP/BGP, IP/GRE, and so on.. )
As I mentioned above there are two implementations of BGP PIC namely, BGP pic edge and BGP pic core. Let's start with BGP PIC Core.
BGP PIC CORE
In the above figure R1, R2, R3, R4, and R5 belong to AS 100, and R6 and R7 belong to AS 200.
There are two EBGP connections between ASBRs of the Service Providers.
Everybody told you so far that BGP is slow because BGP is used for scalability in the networks, not for fast convergence, right?
But that is wrong too. Or at least not enough to understand how BGP converges!
If BGP relies on the control plane to converge of course it will be slow since the default timers are long ( BGP MRAI, BGP Scanner, and so on, although you don't need to rely on them as I will explain now ), prefixes, and path information are too much for Best path selection algorithm to select the second-best path to advertise in case primary path fails.
Default-free zone already has more than 900K prefixes. So approximately we are talking about 100 MB of data from each neighbor, it takes time too. If you have multiple paths, the amount of data that needs to be sent will be much higher.
Let's look at BGP control plane convergence closer...
Imagine that R1 in the above picture learns the 5.5.5.5 prefix from R4 only. R4 is the next hop. (You choose maybe R4 as a primary link with BGP Local preference, MED or you don't do anything but R4 is selected by the hot potato routing because of Route reflector position)
If R4 is the best path, R5 doesn't send the 5.5.5.5 prefix to the IBGP domain unless BGP best external is enabled (I highly recommend you to enable it if you want additional path information in the Active/Standby link ).
How did IBGP routers learn that R4 failed?
There is two mechanisms for that. They will either wait for the BGP Scanner time ( 60 seconds in most implementation ) to check if the BGP next hop for the BGP prefixes are still up or the newer approach BGP Next Hop tracking ( Almost all vendors support it ). With BGP next-hop tracking, BGP next hop prefixes are registered to the IGP route watch process, so as soon as IGP detects the BGP next-hop failure, BGP is informed.
It is similar to BGP, IGP, and LDP registration to the BFD right ? .. Good!
So R1 learned the R4 failure through IGP. Then R1 has to go and delete all the BGP prefixes which are learned from that next hop. If it is a full internet routing table, it is a very time-consuming process as you can imagine. I am talking here for minutes.
In the absence of an already calculated backup path, BGP will rely on this control plane convergence so, of course, it will take time. But you don't have to rely on that. I recommended many service providers start to consider BGP PIC, and Egress FRR for their Internet and VPN services.
In the routers routing table, there is always a recursion for the BGP prefixes. So for the 5.5.5.5 prefix, the next hop would be 10.0.0.1 if the next-hop-self is enabled.
But in order to forward the traffic router need to resolve immediate next hop and layer 2 encapsulation if it is an Ethernet Mac address.
For the BGP next-hop 10.0.0.1 R1 selects either 172.16.0.1 or 172.16.1.1 as an IGP next hop. Or R1 can do the ECMP ( Equal Cost Multipath ) and thus can use both 172.16.0.1 and 172.16.1.1 to reach 10.0.0.1.
In the many vendor FIB implementation, BGP prefixes resolve immediate IGP next hop. Cisco's CEF implementation works in this way too. This is not necessarily a bad thing though.
It provides better throughput since the router doesn't have to do a double/aggregate lookup. But from the fast convergence point of view, we need a hierarchical data plane ( Hierarchical FIB ).
With the BGP PIC, both PIC Core and PIC Edge solutions, you will have a hierarchical data plane so for the 5.5.5.5 you will have 10.0.0.1 or 10.0.0.2 as the next hop in the FIB ( Same as RIB ).
For the 10.0.0.1 and 10.0.0.2, you will have another FIB entry that points to the IGP next-hops which are 172.16.0.1 and 172.16.1.1. These IGP next-hops can be used as load shared or active/standby manner.
BGP PIC Core helps to hide IGP failure from the BGP process. If the links between R1-R2 or, R2-R3 fail, or R2, R3 fails, R1 will start to use backup IGP next-hop immediately. Since the BGP next-hop didn't change and only the IGP path changed, recovery time will be based on IGP convergence.
For the BGP PIC Core, you don't have to have multiple IBGP next hop. BGP PIC Core can handle core IGP link and node failure.
BGP PIC EDGE
Let me explain BGP PIC Edge which can handle edge link or node failure in a slightly different than BGP PIC Core for some scenarios.
In order for BGP PIC Edge to work, edge IBGP devices (Ingress PEs and ASBRs) need to support BGP PIC and also they need to receive backup BGP next hop.
Unfortunately backup next hop is not sent in IBGP Route-Reflector topologies. One of the drawbacks of Route reflector is when it needs to do hot potato by calculating IGP cost to the BGP next-hop, it takes only its cost to the next hop into consideration. Route reflector to BGP next-hops IGP cost calculation might be different from Ingress PE to BGP next-hops cost calculation.
Thus Route reflector may not provide an optimal path for all the Ingress PEs. BGP Optimal route reflection draft specifies a couple of solutions which I covered in my early article here.
How would you send more than one best path from the Route reflector to the Route reflector clients?
There are many ways to do it but two famous ones, are BGP Add-path and BGP Diverse paths ( Multiple Control plane RRs). I will explain these ideas in a separate article.
Assume now we have more than one path on the R1.
We should cover two edge failure scenarios to show how BGP PIC Edge helps in different cases.
In the first case: we are doing BGP next-hop-self on R4 and R4 fails.
This failure information is detected by IGP and Next-hop tracking removes BGP next hop from the BGP path list on R1.
An alternate backup route can be immediately used. This is BGP data plane convergence, not a control plane so convergence time is only related to IGP convergence and prefix independent. If you have a 500K full internet routing table, all of them will be installed in the FIB before the failure as a backup route and when the failure happens, the next BGP next hop is used immediately.
BGP PIC is not necessarily the only BGP feature. Since BGP can take advantage of recursion, hierarchical data plane arrangement. It is also not a Cisco proprietary protocol, most of the vendors implement BGP PIC today.
The second failure scenario might be an edge link between R4 and R6. R4 is our primary next-hop and we are doing next-hop-self on the R4 ( In MPLS VPN, you always do that ! )
If the edge link fails, since BGP's next hop doesn't change on the R1, R1 continues to forward the traffic according to IBGP's best path selection sent by the RR to the R4.
In this case, R4 should redirect to packet to its alternate second best path which is R5. But in an IP environment without tunneling, intermediate nodes which are not converged yet would send the packet back to R4 since they would think that R4 is still reachable so it would be a temporary loop. In the case of MPLS or other tunneling mechanisms, intermediate nodes wouldn't need BGP so they would just send packets to the second-best path as per the R4 request.