Re: BGP Path Selection weirdness regarding next hops

From: John Neiberger <jneiberger_at_gmail.com>
Date: Sat, 1 Dec 2012 08:17:21 -0700

Yes, that is exactly what happened! I had never run into that behavior. The
root cause of this weirdness for us was a linecard that had locked up. It
was no longer passing traffic, but it had not yet reloaded, so the upstream
neighbor timed out OSPF in 4 seconds but BGP stayed up for 180 seconds.
During that time period, traffic for some destinations was blackholed
because of the presence of a supernet route that being used to validate the
reachability of the next hop in BGP.

I had never heard of Next Hop Tracking until yesterday when someone from
Cisco said that a similar problem was solved in IOS XR using this feature,
but I don't recall if they mentioned that this feature was available in
IOS. Our production routers run XR and my home lab is IOS. I'll lab it up
using NHT and use that as a proof-of-concept to the other engineers
involved in tihs.

Thanks go to you and to Marko for helping me through this. We were pretty
confused by it, partially because none of us observed the problem while it
was happening. By the time network engineering was involved, the problem
had resolved. We just had to figure it out step-by-step in reverse, which
took a while.

Thanks again to both of you!
John

On Sat, Dec 1, 2012 at 12:13 AM, Brian Dennis <bdennis_at_ine.com> wrote:

> John,
> Let me see if I can sum this up:
>
> Two iBGP peers (let's call them Peer 1 and Peer 2) are advertising the
> same prefix via iBGP to Peer 3. From Peer 1 the NH (Next-Hop) metric is
> 10 to reach the prefix and from Peer 2 the NH metric is 20. The BGP
> decision process is selecting Peer 1 over Peer 2 due to the lower NH
> metric. You also have a default route learned via iBGP from another peer
> and let's say it has a NH metric of 5.
>
> Now Peer 2 does down and within about 35 to 45 seconds the IGP converges
> and the NH is removed from Peer 3's RIB. Okay fine, so what, you might
> think as you're not using Peer 2 to reach that prefix anyways. The BGP
> peering session is still up due to the default BGP hold timers being
> 60/180 seconds in the IOS. So time wise we are about 60 seconds after the
> failure of Peer 2.
>
> After the default delay timer of 5 seconds the NHT (Next-Hop-Tracking) in
> BGP on Peer 3 kicks-in (or BGP scan process in older IOS versions) to look
> and see if the NH to Peer 2 is still reachable via another route in the
> RIB. The only route available to reach the NH advertised by Peer 2 is the
> default route. The default route's NH metric is less than the NH metric
> to Peer 1 which is the current best path. This means that BGP updates the
> NH metric to Peer 2's NH with the IGP metric to reach the default route
> (5). Now the prefix advertised by Peer 2 becomes the best path since it
> now has a lower NH metric. At this point we are about 66 seconds after
> the failure of Peer 2 but we still have 114 more seconds until BGP detects
> that Peer 2 is actually down so Peer 2's prefix is still useable by BGP.
> The real problem, as we know, is that Peer 2 is actually down but Peer 3
> will not detect it until the BGP hold time finally expires (180 seconds).
> Once Peer 2 is declared down, Peer 3 will switch back to Peer 1's prefix.
> The problem comes in that traffic at best is sub-optimally routed or worst
> "black holed" when this occurs as Peer 3 isn't using the "best" path (Peer
> 1).
>
> This is actually a common problem with a simple solution. Just don't
> allow BGP to use any /31 or longer (or whatever length you want) and/or
> not use another BGP route to reach the next hop using BGP Selective NHT.
>
>
> router bgp 10
> bgp nexthop route-map RM_NH_FILTER
> !
>
> ip prefix-list PL_NH_FILTER seq 5 permit 0.0.0.0/0 le 31
> !
> route-map RM_NH_FILTER deny 10
> match ip address prefix-list PL_NH_FILTER
> !
> route-map RM_NH_FILTER deny 20
> match source-protocol bgp 10
> !
> route-map RM_NH_FILTER permit 30
>
>
> With this configuration Peer 3 will show the NH to Peer 2's prefix as
> "inaccessible" as it can not use any route with a mask longer than a /32
> and/or a route that's installed in the RIB via BGP. This can also be used
> to not allow BGP to use a discard route for the NH which is another common
> problem.
>
>
> I'll just do a quick blog post on this tomorrow. I labbed it all up
> already to verify what I talked about above but the post will have to wait
> until tomorrow Wife is telling me it's 11pm and time to get off the
> computer as it's Friday night ;-) There are a few minor caveats to this
> that I'll mention this weekend in the blog post. Also I pasted my tests
> below.
>
> --
> Brian Dennis, CCIEx5 #2210 (R&S/ISP-Dial/Security/SP/Voice)
> bdennis_at_ine.com
>
> INE, Inc.
> http://www.INE.com
>
>
>
>
> ******************************************************
> R6 is our Peer 3 and R3 and R5 are Peer 1 and 2
> ******************************************************
>
> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
> BGP routing table entry for 50.0.0.0/8, version 3
> Paths: (2 available, best #2, table Default-IP-Routing-Table)
> Advertised to update-groups:
> 2
> 200, (Received from a RR-client)
> 10.5.5.5 (metric 20) from 10.5.5.5 (10.5.5.5)
> Origin incomplete, metric 0, localpref 100, valid, internal
> 200, (Received from a RR-client)
> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
> Origin incomplete, metric 0, localpref 100, valid, internal, best
> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
> BGP routing table entry for 0.0.0.0/0, version 2
> Paths: (1 available, best #1, table Default-IP-Routing-Table)
> Advertised to update-groups:
> 2
> Local, (Received from a RR-client)
> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
> Origin incomplete, metric 0, localpref 100, valid, internal, best
> Rack1R6#show ip route 10.3.3.3
> Routing entry for 10.3.3.3/32
> Known via "ospf 1", distance 110, metric 10, type intra area
> Last update from 173.1.36.3 on GigabitEthernet0/0.36, 00:17:05 ago
> Routing Descriptor Blocks:
> * 173.1.36.3, from 10.3.3.3, 00:17:05 ago, via GigabitEthernet0/0.36
> Route metric is 10, traffic share count is 1
>
> Rack1R6#show ip route 10.5.5.5
> Routing entry for 10.5.5.5/32
> Known via "ospf 1", distance 110, metric 20, type intra area
> Last update from 173.1.0.205 on GigabitEthernet0/0.456, 00:00:50 ago
> Routing Descriptor Blocks:
> * 173.1.0.205, from 10.5.5.5, 00:00:50 ago, via GigabitEthernet0/0.456
> Route metric is 20, traffic share count is 1
>
> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
> BGP routing table entry for 0.0.0.0/0, version 2
> Paths: (1 available, best #1, table Default-IP-Routing-Table)
> Advertised to update-groups:
> 2
> Local, (Received from a RR-client)
> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
> Origin incomplete, metric 0, localpref 100, valid, internal,
> best
> Rack1R6#
>
> ******************************************************
> Not let's shutdown the interface R5 uses to reach R6.
> ******************************************************
>
> Rack1R5#conf t
> Enter configuration commands, one per line. End with CNTL/Z.
> Rack1R5(config)#int fa0/0.456
> Rack1R5(config-subif)#shut
> Rack1R5(config-subif)#^Z
> Rack1R5#
> %OSPF-5-ADJCHG: Process 1, Nbr 10.6.6.6 on FastEthernet0/0.456 from FULL
> to DOWN, Neighbor Down: Interface down or detached
> %OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.10 on FastEthernet0/0.456 from
> FULL to DOWN, Neighbor Down: Interface down or detached
> %SYS-5-CONFIG_I: Configured from console by console
> Rack1R5#
>
> ******************************************************
> Now wait for OSPF to converge
> ******************************************************
>
> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
> BGP routing table entry for 50.0.0.0/8, version 4
> Paths: (2 available, best #1, table Default-IP-Routing-Table)
> Flag: 0x900
> Advertised to update-groups:
> 2
> 200, (Received from a RR-client)
> 10.5.5.5 (metric 2) from 10.5.5.5 (10.5.5.5)
> Origin incomplete, metric 0, localpref 100, valid, internal, best
> 200, (Received from a RR-client)
> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
> Origin incomplete, metric 0, localpref 100, valid, internal
> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
> BGP routing table entry for 0.0.0.0/0, version 2
> Paths: (1 available, best #1, table Default-IP-Routing-Table)
> Advertised to update-groups:
> 2
> Local, (Received from a RR-client)
> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
> Origin incomplete, metric 0, localpref 100, valid, internal, best
> Rack1R6#
>
> ******************************************************
> Now roughly 180 seconds later
> ******************************************************
>
> Rack1R6#
> %BGP-3-NOTIFICATION: received from neighbor 10.5.5.5 4/0 (hold time
> expired) 0 bytes
> Rack1R6#
> %BGP-5-ADJCHANGE: neighbor 10.5.5.5 Down BGP Notification received
> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
> BGP routing table entry for 50.0.0.0/8, version 5
> Paths: (1 available, best #1, table Default-IP-Routing-Table)
> Flag: 0x900
> Advertised to update-groups:
> 2
> 200, (Received from a RR-client)
> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
> Origin incomplete, metric 0, localpref 100, valid, internal, best
> Rack1R6#
>
> ******************************************************
> But with the Selective NHT config this is the result
> ******************************************************
>
> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
> BGP routing table entry for 50.0.0.0/8, version 5
> Paths: (2 available, best #2, table Default-IP-Routing-Table)
> Advertised to update-groups:
> 2
> 200, (Received from a RR-client)
> 10.5.5.5 (inaccessible) from 10.5.5.5 (10.5.5.5)
> Origin incomplete, metric 0, localpref 100, valid, internal
> 200, (Received from a RR-client)
> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
> Origin incomplete, metric 0, localpref 100, valid, internal, best
> Rack1R6#
>
>
>
>
>
>
> On 11/30/12 6:21 PM, "John Neiberger" <jneiberger_at_gmail.com> wrote:7
>
> >I posted this question to the Cisco NSP list and I've also talked to a
> >couple of guys from Cisco Advanced Services and I'm still stumped about
> >something. I'll try my best to phrase it in a way that makes sense.
> >
> >Router A is learning about a prefix from two route reflector clients. In
> >both cases, the next hop for the prefix is the loopback address of the
> >advertising routers. Their loopback addresses are being advertised into
> >OSPF.
> >
> >So, from the perspective of Router A, it's BGP table for this prefix has
> >two paths:
> >
> >1: 4.4.4.4 (loopback address of Router B, learned via OSPF) * winner due
> >to lower IGP metric
> >2. 5.5.5.5 (loopback address of Router C, learned via OSPF)
> >
> >Now for the weirdness to begin. A network event occurs that causes the
> >loopback address of Router C to go away. This shouldn't affect Router A
> >because it is already selecting the shortest path to the network via
> >Router
> >B (4.4.4.4).
> >
> >However, Router A is also learning a default via BGP. That means that even
> >though 5.5.5.5 (loopback of Router C) disappeared and is unreachable, the
> >router is doing a recursive lookup and keeps the path in the BGP table;
> >5.5.5.5 is still reachable, it thinks, by using the default route.
> >
> >The weird thing is that this causes Router A to start using the wrong
> >path!
> >It seems to be preferring a path with a next hop learned via BGP to a path
> >with a next hop learned via OSPF. Why would it do this? I see no
> >documentation that would explain why a BGP-learned next hop is preferred
> >over an IGP-learned next hop.
> >
> >Is the router still comparing IGP metrics even though the "wrong" path now
> >has no IGP metric?
> >
> >It's not changing due to router ID, cluster length, or neighbor IP
> >address.
> >I checked. So, why is it switching?
> >
> >As soon as the BGP session from Router A to Router C times out, the
> >extraneous path gets removed from the BGP table and the router goes back
> >to
> >using the correct path it should have been using all along.
> >
> >So, is a BGP-learned next hop preferred over an IGP-learned next hop? If
> >so, why? If not, any idea why my router switches paths? I've turned on BGP
> >debugging and IP routing debugging and haven't found a suitable
> >explanation
> >for the switch.
> >
> >John
> >
> >
> >Blogs and organic groups at http://www.ccie.net
> >
> >_______________________________________________________________________
> >Subscription information may be found at:
> >http://www.groupstudy.com/list/CCIELab.html

Blogs and organic groups at http://www.ccie.net
Received on Sat Dec 01 2012 - 08:17:21 ART

This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 09:36:53 ART