BGP PIC Fundamentals: BGP PIC ( Prefix Independent Convergence ) is a BGP Fast reroute mechanism which can provides sub second convergence even for the 800K internet prefixes by taking help of IGP convergence.
BGP PIC uses hierarchical data plane in contrast to flat FIB (Forwarding table) design which is used by Cisco CEF and many legacy platforms.
In a hierarchical dataplane , the FIB used by the packet processing engine reflects recursions between the routes.
I will explain the recursion concept throughout the post so don’t worry about the above sentence, it will make sense.
There are two implementation of BGP PIC concept and they can protect the network traffic from multiple failures.
Link, node in the core or edge of the network can be recovered under a second and most of the case under 100ms ( It mostly depends on IGP convergence, so IGP should be tuned or IGP FRR can be used ).
In this article I will not explained IGP fast convergence or IGP Fast reroute but you can read my Fast reroute mechanism article from here.
BGP PIC can be thought as BGP Fast Reroute Mechanism which relies on IGP convergence for the failure detection. ( All overlay protocols rely on underlay protocol convergence ie LDP/IGP Synchronization, STP/HSRP, IGP/BGP , IP/GRE and so on.. )
As I mentioned above there are two implementation of BGP PIC namely, BGP PIC Edge and BGP PIC Core.
Let’s start with BGP PIC Core.
In the above figure R1 , R2, R3, R4 and R5 belongs to AS 100 , R6 and R7 belongs to AS 200.
There are two EBGP connections between ASBRs of the Service Providers.
Everybody told you so far that BGP is slow because BGP is used for scalability in the networks, not for the fast convergence, right ?
But that is wrong too. Or at least not enough to understand how BGP converges!
If BGP relies on control plane to converge of course it will be slow since the default timers are long ( BGP MRAI, BGP Scanner and so on , although you don’t need to rely on them as I will explain now ) , prefixes and path information are too much for Best path selection alghroitm to select second best path to advertise in case primary path fails.
Default free zone is already more than 800K prefixes. So approximately we are talking about 100 MB of data from each neighbor, it takes time too. If you have multiple path, amount of data needs to be sent will be much higher.
Let’s look at BGP control plane convergence closer ..
Imagine that R1 in the above picture learns 184.108.40.206 prefix from R4 only. R4 is the next hop. (You choose maybe R4 as primary link with Local preference , MED or you don’t do anything but R4 is selected by the hot potato routing because of Route reflector position)
If R4 is the best path , R5 doesn’t send 220.127.116.11 prefix to the IBGP domain unless BGP best external is enabled (I highly recommend you to enable it if you want additional path information in Active/Standby link ).
How IBGP routers learned that R4 failed ?
There are two mechanism for that. They will either wait for the BGP Scanner time ( 60 seconds in most implementation ) to check if the BGP next hop for the BGP prefixes are still up , or the newer approach BGP Next Hop tracking ( Almost all vendors support it ). With BGP next hop tracking, BGP next hop prefixes are registered to the IGP route watch process , so as soon as IGP detects the BGP next hop failure, BGP is informed.
It is similar to BGP,IGP,LDP registration to the BFD right ? .. Good !
So R1 learned the R4 failure through IGP. Then R1 has to go and delete all the BGP prefixes which are learned from that next hop. If it is full internet routing table, it is very time consuming process as you can imagine. I am talking here minutes.
In the absence of already calculated backup path, BGP will rely on this control plane convergence so of course it will take time. But you don’t have to rely on that. I recommended many service providers to start consider BGP PIC, Egress FRR for their Internet and VPN services.
In the routers routing table there is always a recursion for the BGP prefixes. So for the 18.104.22.168 prefix the next hop would be 10.0.0.1 if the next-hop self is enabled.
But in order to forward the traffic router need to resolve immediate next hop and layer 2 encapsulation , if it is Ethernet Mac address.
For the BGP next hop 10.0.0.1 R1 selects either 172.16.0.1 or 172.16.1.1 as an IGP next hop. Or R1 can do the ECMP ( Equal Cost Multipath ) thus can use both 172.16.0.1 and 172.16.1.1 to reach 10.0.0.1.
In the many vendor FIB implementation, BGP prefixes resolve immediate IGP next hop. Cisco’s CEF implementation works in this way too. This is not necessarily a bad thing though.
It provides better throughput since the router doesn’t have to do double/aggregate lookup. But from the fast convergence point of view, we need a hierarchical data plane ( Hierarchical FIB ).
With the BGP PIC , both PIC Core and PIC Edge solutions, you will have hierarchical data plane so for the 22.214.171.124 you will have 10.0.0.1 or 10.0.0.2 as the next hop in the FIB ( Same as RIB ).
For the 10.0.0.1 and 10.0.0.2 you will have another FIB entry which points to the IGP next hops which is 172.16.0.1 and 172.16.1.1. These IGP next hops can be used as load shared or active/standby manner.
BGP PIC Core helps to hide IGP failure from the BGP process. If the links between R1-R2 or, R2-R3 fails, or R2 , R3 fails, R1 will start to use backup IGP next hop immediately. Since the BGP next hop didn’t change and only the IGP path change, recovery time will be based on IGP convergence.
For the BGP PIC Core you don’t have to have multiple IBGP next hop. BGP PIC Core can handle core IGP link and node failure.
BGP PIC EDGE
Let me explain BGP PIC Edge which can handle edge link or node failure in a slightly different than BGP PIC Core for some scenarios.
In order to BGP PIC Edge to work, edge IBGP devices ( Ingress PEs and ASBRs ) need to support BGP PIC and also they need to receive backup BGP next hop.
Unfortunately backup next hop is not sent in IBGP Route Reflector topologies. One of the drawback of Route reflector , when it needs to do hot potato by calculating IGP cost to the BGP next hop , it takes only its cost to the next hop into consideration. Route reflector to BGP next hops IGP cost calculation might be different from Ingress PE to BGP next hops cost calculation.
Thus Route reflector may not provide optimal path for all the Ingress PEs. BGP Optimal route reflection draft specifies couple solutions which I covered in my early article here.
How would you send more than one best path from Route reflector to the Route reflector clients ?
There are many ways to do it but two famous one, BGP Add-path and BGP Diverse paths ( Multiple Control plane RRs). I will explain these ideas in a separate article.
Assume now we have more than one path on the R1.
We should cover two edge failure scenarios to show how BGP PIC Edge helps in different cases.
In the first case : we are doing BGP next hop self on R4 and R4 fails.
This failure information is detected by IGP and Next hop tracking removes BGP next hop from the BGP path-list on R1.
Alternate backup route can be immediately used. This is BGP data plane convergence , not a control plane so convergence time is only related with IGP convergence and prefix independent. If you have 500K full internet routing table, all of them will be installed in the FIB before the failure as backup route and when the failure happens, next BGP next hop is used immediately.
BGP PIC is not necessarily only BGP feature. Since BGP can take advantage of recursion, hierarchical data plane arrangement. It is also not a Cisco proprietary protocol, most of the vendors implement BGP PIC today.
Second failure scenario might be edge link between R4 and R6. R4 is our primary next hop and we are doing next hop self on the R4 ( In MPLS VPN , you always do that ! )
If the edge link fails, since BGP next hop doesn’t change on the R1 , R1 continues to forward the traffic according to IBGP best path selection sent by the RR to the R4.
In this case R4 should redirect to packet to its alternate second best path which is R5. But in IP environment without tunneling , intermediate nodes which are not converged yet would send the packet back to R4 since they would think that R4 is still reachable so it would be temporary loop. In the case of MPLS or other tunneling mechanisms intermediate nodes wouldn’t need BGP so they would just send packet to second best path as per the R4 request.