Re: BGP Path Selection weirdness regarding next hops

From: Brian McGahan <bmcgahan_at_ine.com>
Date: Sat, 1 Dec 2012 09:36:02 -0600

What XR version are you running?

Brian McGahan, CCIE #8593 (R&S/SP/Security)
bmcgahan_at_INE.com

On Dec 1, 2012, at 9:18 AM, "John Neiberger" <jneiberger_at_gmail.com> wrote:

> Yes, that is exactly what happened! I had never run into that behavior. The
> root cause of this weirdness for us was a linecard that had locked up. It
> was no longer passing traffic, but it had not yet reloaded, so the upstream
> neighbor timed out OSPF in 4 seconds but BGP stayed up for 180 seconds.
> During that time period, traffic for some destinations was blackholed
> because of the presence of a supernet route that being used to validate the
> reachability of the next hop in BGP.
>
> I had never heard of Next Hop Tracking until yesterday when someone from
> Cisco said that a similar problem was solved in IOS XR using this feature,
> but I don't recall if they mentioned that this feature was available in
> IOS. Our production routers run XR and my home lab is IOS. I'll lab it up
> using NHT and use that as a proof-of-concept to the other engineers
> involved in tihs.
>
> Thanks go to you and to Marko for helping me through this. We were pretty
> confused by it, partially because none of us observed the problem while it
> was happening. By the time network engineering was involved, the problem
> had resolved. We just had to figure it out step-by-step in reverse, which
> took a while.
>
> Thanks again to both of you!
> John
>
>
> On Sat, Dec 1, 2012 at 12:13 AM, Brian Dennis <bdennis_at_ine.com> wrote:
>
>> John,
>> Let me see if I can sum this up:
>>
>> Two iBGP peers (let's call them Peer 1 and Peer 2) are advertising the
>> same prefix via iBGP to Peer 3. From Peer 1 the NH (Next-Hop) metric is
>> 10 to reach the prefix and from Peer 2 the NH metric is 20. The BGP
>> decision process is selecting Peer 1 over Peer 2 due to the lower NH
>> metric. You also have a default route learned via iBGP from another peer
>> and let's say it has a NH metric of 5.
>>
>> Now Peer 2 does down and within about 35 to 45 seconds the IGP converges
>> and the NH is removed from Peer 3's RIB. Okay fine, so what, you might
>> think as you're not using Peer 2 to reach that prefix anyways. The BGP
>> peering session is still up due to the default BGP hold timers being
>> 60/180 seconds in the IOS. So time wise we are about 60 seconds after the
>> failure of Peer 2.
>>
>> After the default delay timer of 5 seconds the NHT (Next-Hop-Tracking) in
>> BGP on Peer 3 kicks-in (or BGP scan process in older IOS versions) to look
>> and see if the NH to Peer 2 is still reachable via another route in the
>> RIB. The only route available to reach the NH advertised by Peer 2 is the
>> default route. The default route's NH metric is less than the NH metric
>> to Peer 1 which is the current best path. This means that BGP updates the
>> NH metric to Peer 2's NH with the IGP metric to reach the default route
>> (5). Now the prefix advertised by Peer 2 becomes the best path since it
>> now has a lower NH metric. At this point we are about 66 seconds after
>> the failure of Peer 2 but we still have 114 more seconds until BGP detects
>> that Peer 2 is actually down so Peer 2's prefix is still useable by BGP.
>> The real problem, as we know, is that Peer 2 is actually down but Peer 3
>> will not detect it until the BGP hold time finally expires (180 seconds).
>> Once Peer 2 is declared down, Peer 3 will switch back to Peer 1's prefix.
>> The problem comes in that traffic at best is sub-optimally routed or worst
>> "black holed" when this occurs as Peer 3 isn't using the "best" path (Peer
>> 1).
>>
>> This is actually a common problem with a simple solution. Just don't
>> allow BGP to use any /31 or longer (or whatever length you want) and/or
>> not use another BGP route to reach the next hop using BGP Selective NHT.
>>
>>
>> router bgp 10
>> bgp nexthop route-map RM_NH_FILTER
>> !
>>
>> ip prefix-list PL_NH_FILTER seq 5 permit 0.0.0.0/0 le 31
>> !
>> route-map RM_NH_FILTER deny 10
>> match ip address prefix-list PL_NH_FILTER
>> !
>> route-map RM_NH_FILTER deny 20
>> match source-protocol bgp 10
>> !
>> route-map RM_NH_FILTER permit 30
>>
>>
>> With this configuration Peer 3 will show the NH to Peer 2's prefix as
>> "inaccessible" as it can not use any route with a mask longer than a /32
>> and/or a route that's installed in the RIB via BGP. This can also be used
>> to not allow BGP to use a discard route for the NH which is another common
>> problem.
>>
>>
>> I'll just do a quick blog post on this tomorrow. I labbed it all up
>> already to verify what I talked about above but the post will have to wait
>> until tomorrow Wife is telling me it's 11pm and time to get off the
>> computer as it's Friday night ;-) There are a few minor caveats to this
>> that I'll mention this weekend in the blog post. Also I pasted my tests
>> below.
>>
>> --
>> Brian Dennis, CCIEx5 #2210 (R&S/ISP-Dial/Security/SP/Voice)
>> bdennis_at_ine.com
>>
>> INE, Inc.
>> http://www.INE.com
>>
>>
>>
>>
>> ******************************************************
>> R6 is our Peer 3 and R3 and R5 are Peer 1 and 2
>> ******************************************************
>>
>> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> BGP routing table entry for 50.0.0.0/8, version 3
>> Paths: (2 available, best #2, table Default-IP-Routing-Table)
>> Advertised to update-groups:
>> 2
>> 200, (Received from a RR-client)
>> 10.5.5.5 (metric 20) from 10.5.5.5 (10.5.5.5)
>> Origin incomplete, metric 0, localpref 100, valid, internal
>> 200, (Received from a RR-client)
>> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>> Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>> BGP routing table entry for 0.0.0.0/0, version 2
>> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> Advertised to update-groups:
>> 2
>> Local, (Received from a RR-client)
>> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>> Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#show ip route 10.3.3.3
>> Routing entry for 10.3.3.3/32
>> Known via "ospf 1", distance 110, metric 10, type intra area
>> Last update from 173.1.36.3 on GigabitEthernet0/0.36, 00:17:05 ago
>> Routing Descriptor Blocks:
>> * 173.1.36.3, from 10.3.3.3, 00:17:05 ago, via GigabitEthernet0/0.36
>> Route metric is 10, traffic share count is 1
>>
>> Rack1R6#show ip route 10.5.5.5
>> Routing entry for 10.5.5.5/32
>> Known via "ospf 1", distance 110, metric 20, type intra area
>> Last update from 173.1.0.205 on GigabitEthernet0/0.456, 00:00:50 ago
>> Routing Descriptor Blocks:
>> * 173.1.0.205, from 10.5.5.5, 00:00:50 ago, via GigabitEthernet0/0.456
>> Route metric is 20, traffic share count is 1
>>
>> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>> BGP routing table entry for 0.0.0.0/0, version 2
>> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> Advertised to update-groups:
>> 2
>> Local, (Received from a RR-client)
>> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>> Origin incomplete, metric 0, localpref 100, valid, internal,
>> best
>> Rack1R6#
>>
>> ******************************************************
>> Not let's shutdown the interface R5 uses to reach R6.
>> ******************************************************
>>
>> Rack1R5#conf t
>> Enter configuration commands, one per line. End with CNTL/Z.
>> Rack1R5(config)#int fa0/0.456
>> Rack1R5(config-subif)#shut
>> Rack1R5(config-subif)#^Z
>> Rack1R5#
>> %OSPF-5-ADJCHG: Process 1, Nbr 10.6.6.6 on FastEthernet0/0.456 from FULL
>> to DOWN, Neighbor Down: Interface down or detached
>> %OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.10 on FastEthernet0/0.456 from
>> FULL to DOWN, Neighbor Down: Interface down or detached
>> %SYS-5-CONFIG_I: Configured from console by console
>> Rack1R5#
>>
>> ******************************************************
>> Now wait for OSPF to converge
>> ******************************************************
>>
>> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> BGP routing table entry for 50.0.0.0/8, version 4
>> Paths: (2 available, best #1, table Default-IP-Routing-Table)
>> Flag: 0x900
>> Advertised to update-groups:
>> 2
>> 200, (Received from a RR-client)
>> 10.5.5.5 (metric 2) from 10.5.5.5 (10.5.5.5)
>> Origin incomplete, metric 0, localpref 100, valid, internal, best
>> 200, (Received from a RR-client)
>> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>> Origin incomplete, metric 0, localpref 100, valid, internal
>> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>> BGP routing table entry for 0.0.0.0/0, version 2
>> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> Advertised to update-groups:
>> 2
>> Local, (Received from a RR-client)
>> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>> Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#
>>
>> ******************************************************
>> Now roughly 180 seconds later
>> ******************************************************
>>
>> Rack1R6#
>> %BGP-3-NOTIFICATION: received from neighbor 10.5.5.5 4/0 (hold time
>> expired) 0 bytes
>> Rack1R6#
>> %BGP-5-ADJCHANGE: neighbor 10.5.5.5 Down BGP Notification received
>> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> BGP routing table entry for 50.0.0.0/8, version 5
>> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> Flag: 0x900
>> Advertised to update-groups:
>> 2
>> 200, (Received from a RR-client)
>> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>> Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#
>>
>> ******************************************************
>> But with the Selective NHT config this is the result
>> ******************************************************
>>
>> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> BGP routing table entry for 50.0.0.0/8, version 5
>> Paths: (2 available, best #2, table Default-IP-Routing-Table)
>> Advertised to update-groups:
>> 2
>> 200, (Received from a RR-client)
>> 10.5.5.5 (inaccessible) from 10.5.5.5 (10.5.5.5)
>> Origin incomplete, metric 0, localpref 100, valid, internal
>> 200, (Received from a RR-client)
>> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>> Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#
>>
>>
>>
>>
>>
>>
>> On 11/30/12 6:21 PM, "John Neiberger" <jneiberger_at_gmail.com> wrote:7
>>
>>> I posted this question to the Cisco NSP list and I've also talked to a
>>> couple of guys from Cisco Advanced Services and I'm still stumped about
>>> something. I'll try my best to phrase it in a way that makes sense.
>>>
>>> Router A is learning about a prefix from two route reflector clients. In
>>> both cases, the next hop for the prefix is the loopback address of the
>>> advertising routers. Their loopback addresses are being advertised into
>>> OSPF.
>>>
>>> So, from the perspective of Router A, it's BGP table for this prefix has
>>> two paths:
>>>
>>> 1: 4.4.4.4 (loopback address of Router B, learned via OSPF) * winner due
>>> to lower IGP metric
>>> 2. 5.5.5.5 (loopback address of Router C, learned via OSPF)
>>>
>>> Now for the weirdness to begin. A network event occurs that causes the
>>> loopback address of Router C to go away. This shouldn't affect Router A
>>> because it is already selecting the shortest path to the network via
>>> Router
>>> B (4.4.4.4).
>>>
>>> However, Router A is also learning a default via BGP. That means that even
>>> though 5.5.5.5 (loopback of Router C) disappeared and is unreachable, the
>>> router is doing a recursive lookup and keeps the path in the BGP table;
>>> 5.5.5.5 is still reachable, it thinks, by using the default route.
>>>
>>> The weird thing is that this causes Router A to start using the wrong
>>> path!
>>> It seems to be preferring a path with a next hop learned via BGP to a path
>>> with a next hop learned via OSPF. Why would it do this? I see no
>>> documentation that would explain why a BGP-learned next hop is preferred
>>> over an IGP-learned next hop.
>>>
>>> Is the router still comparing IGP metrics even though the "wrong" path now
>>> has no IGP metric?
>>>
>>> It's not changing due to router ID, cluster length, or neighbor IP
>>> address.
>>> I checked. So, why is it switching?
>>>
>>> As soon as the BGP session from Router A to Router C times out, the
>>> extraneous path gets removed from the BGP table and the router goes back
>>> to
>>> using the correct path it should have been using all along.
>>>
>>> So, is a BGP-learned next hop preferred over an IGP-learned next hop? If
>>> so, why? If not, any idea why my router switches paths? I've turned on BGP
>>> debugging and IP routing debugging and haven't found a suitable
>>> explanation
>>> for the switch.
>>>
>>> John
>>>
>>>
>>> Blogs and organic groups at http://www.ccie.net
>>>
>>> _______________________________________________________________________
>>> Subscription information may be found at:
>>> http://www.groupstudy.com/list/CCIELab.html
>
>
> Blogs and organic groups at http://www.ccie.net
>
> _______________________________________________________________________
> Subscription information may be found at:
> http://www.groupstudy.com/list/CCIELab.html

Blogs and organic groups at http://www.ccie.net
Received on Sat Dec 01 2012 - 09:36:02 ART

This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 09:36:53 ART