Re: BGP Path Selection weirdness regarding next hops from Marko Milivojevic on 2012-12-01 (Ccielab archives 12/2012)

From: Marko Milivojevic <markom_at_ipexpert.com>
Date: Sat, 1 Dec 2012 08:41:34 -0800
I was hoping you'd post the findings of our webex session last night - didn't want to steal the thunder :-)
Personally, I think we've uncovered great many little things that conspired to cause this :-)
--
Marko Milivojevic - CCIE #18427 (SP R&S)
Senior CCIE Instructor - IPexpert
:: This message was sent from a mobile device. I apologize for errors and brevity. ::
On Dec 1, 2012, at 7:17, John Neiberger <jneiberger_at_gmail.com> wrote:
> Yes, that is exactly what happened! I had never run into that behavior. The
> root cause of this weirdness for us was a linecard that had locked up. It
> was no longer passing traffic, but it had not yet reloaded, so the upstream
> neighbor timed out OSPF in 4 seconds but BGP stayed up for 180 seconds.
> During that time period, traffic for some destinations was blackholed
> because of the presence of a supernet route that being used to validate the
> reachability of the next hop in BGP.
> 
> I had never heard of Next Hop Tracking until yesterday when someone from
> Cisco said that a similar problem was solved in IOS XR using this feature,
> but I don't recall if they mentioned that this feature was available in
> IOS. Our production routers run XR and my home lab is IOS. I'll lab it up
> using NHT and use that as a proof-of-concept to the other engineers
> involved in tihs.
> 
> Thanks go to you and to Marko for helping me through this. We were pretty
> confused by it, partially because none of us observed the problem while it
> was happening. By the time network engineering was involved, the problem
> had resolved. We just had to figure it out step-by-step in reverse, which
> took a while.
> 
> Thanks again to both of you!
> John
> 
> 
> On Sat, Dec 1, 2012 at 12:13 AM, Brian Dennis <bdennis_at_ine.com> wrote:
> 
>> John,
>> Let me see if I can sum this up:
>> 
>> Two iBGP peers (let's call them Peer 1 and Peer 2) are advertising the
>> same prefix via iBGP to Peer 3.  From Peer 1 the NH (Next-Hop) metric is
>> 10 to reach the prefix and from Peer 2 the NH metric is 20.  The BGP
>> decision process is selecting Peer 1 over Peer 2 due to the lower NH
>> metric.  You also have a default route learned via iBGP from another peer
>> and let's say it has a NH metric of 5.
>> 
>> Now Peer 2 does down and within about 35 to 45 seconds the IGP converges
>> and the NH is removed from Peer 3's RIB.  Okay fine, so what, you might
>> think as you're not using Peer 2 to reach that prefix anyways.  The BGP
>> peering session is still up due to the default BGP hold timers being
>> 60/180 seconds in the IOS.  So time wise we are about 60 seconds after the
>> failure of Peer 2.
>> 
>> After the default delay timer of 5 seconds the NHT (Next-Hop-Tracking) in
>> BGP on Peer 3 kicks-in (or BGP scan process in older IOS versions) to look
>> and see if the NH to Peer 2 is still reachable via another route in the
>> RIB.  The only route available to reach the NH advertised by Peer 2 is the
>> default route.  The default route's NH metric is less than the NH metric
>> to Peer 1 which is the current best path.  This means that BGP updates the
>> NH metric to Peer 2's NH with the IGP metric to reach the default route
>> (5).  Now the prefix advertised by Peer 2 becomes the best path since it
>> now has a lower NH metric.  At this point we are about 66 seconds after
>> the failure of Peer 2 but we still have 114 more seconds until BGP detects
>> that Peer 2 is actually down so Peer 2's prefix is still useable by BGP.
>> The real problem, as we know, is that Peer 2 is actually down but Peer 3
>> will not detect it until the BGP hold time finally expires (180 seconds).
>> Once Peer 2 is declared down, Peer 3 will switch back to Peer 1's prefix.
>> The problem comes in that traffic at best is sub-optimally routed or worst
>> "black holed" when this occurs as Peer 3 isn't using the "best" path (Peer
>> 1).
>> 
>> This is actually a common problem with a simple solution.  Just don't
>> allow BGP to use any /31 or longer (or whatever length you want) and/or
>> not use another BGP route to reach the next hop using BGP Selective NHT.
>> 
>> 
>> router bgp 10
>> bgp nexthop route-map RM_NH_FILTER
>> !
>> 
>> ip prefix-list PL_NH_FILTER seq 5 permit 0.0.0.0/0 le 31
>> !
>> route-map RM_NH_FILTER deny 10
>> match ip address prefix-list PL_NH_FILTER
>> !
>> route-map RM_NH_FILTER deny 20
>> match source-protocol bgp 10
>> !
>> route-map RM_NH_FILTER permit 30
>> 
>> 
>> With this configuration Peer 3 will show the NH to Peer 2's prefix as
>> "inaccessible" as it can not use any route with a mask longer than a /32
>> and/or a route that's installed in the RIB via BGP. This can also be used
>> to not allow BGP to use a discard route for the NH which is another common
>> problem.
>> 
>> 
>> I'll just do a quick blog post on this tomorrow.  I labbed it all up
>> already to verify what I talked about above but the post will have to wait
>> until tomorrow Wife is telling me it's 11pm and time to get off the
>> computer as it's Friday night ;-)  There are a few minor caveats to this
>> that I'll mention this weekend in the blog post.  Also I pasted my tests
>> below.
>> 
>> --
>> Brian Dennis, CCIEx5 #2210 (R&S/ISP-Dial/Security/SP/Voice)
>> bdennis_at_ine.com
>> 
>> INE, Inc.
>> http://www.INE.com
>> 
>> 
>> 
>> 
>> ******************************************************
>>  R6 is our Peer 3 and R3 and R5 are Peer 1 and 2
>> ******************************************************
>> 
>> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> BGP routing table entry for 50.0.0.0/8, version 3
>> Paths: (2 available, best #2, table Default-IP-Routing-Table)
>>  Advertised to update-groups:
>>        2
>>  200, (Received from a RR-client)
>>    10.5.5.5 (metric 20) from 10.5.5.5 (10.5.5.5)
>>      Origin incomplete, metric 0, localpref 100, valid, internal
>>  200, (Received from a RR-client)
>>    10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>>      Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>> BGP routing table entry for 0.0.0.0/0, version 2
>> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>>  Advertised to update-groups:
>>        2
>>  Local, (Received from a RR-client)
>>    10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>>      Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#show ip route 10.3.3.3
>> Routing entry for 10.3.3.3/32
>>  Known via "ospf 1", distance 110, metric 10, type intra area
>>  Last update from 173.1.36.3 on GigabitEthernet0/0.36, 00:17:05 ago
>>  Routing Descriptor Blocks:
>>  * 173.1.36.3, from 10.3.3.3, 00:17:05 ago, via GigabitEthernet0/0.36
>>      Route metric is 10, traffic share count is 1
>> 
>> Rack1R6#show ip route 10.5.5.5
>> Routing entry for 10.5.5.5/32
>>  Known via "ospf 1", distance 110, metric 20, type intra area
>>  Last update from 173.1.0.205 on GigabitEthernet0/0.456, 00:00:50 ago
>>  Routing Descriptor Blocks:
>>  * 173.1.0.205, from 10.5.5.5, 00:00:50 ago, via GigabitEthernet0/0.456
>>      Route metric is 20, traffic share count is 1
>> 
>> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>>        BGP routing table entry for 0.0.0.0/0, version 2
>>        Paths: (1 available, best #1, table Default-IP-Routing-Table)
>>          Advertised to update-groups:
>>                2
>>          Local, (Received from a RR-client)
>>            10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>>              Origin incomplete, metric 0, localpref 100, valid, internal,
>> best
>> Rack1R6#
>> 
>> ******************************************************
>> Not let's shutdown the interface R5 uses to reach R6.
>> ******************************************************
>> 
>> Rack1R5#conf t
>> Enter configuration commands, one per line.  End with CNTL/Z.
>> Rack1R5(config)#int fa0/0.456
>> Rack1R5(config-subif)#shut
>> Rack1R5(config-subif)#^Z
>> Rack1R5#
>> %OSPF-5-ADJCHG: Process 1, Nbr 10.6.6.6 on FastEthernet0/0.456 from FULL
>> to DOWN, Neighbor Down: Interface down or detached
>> %OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.10 on FastEthernet0/0.456 from
>> FULL to DOWN, Neighbor Down: Interface down or detached
>> %SYS-5-CONFIG_I: Configured from console by console
>> Rack1R5#
>> 
>> ******************************************************
>>          Now wait for OSPF to converge
>> ******************************************************
>> 
>> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> BGP routing table entry for 50.0.0.0/8, version 4
>> Paths: (2 available, best #1, table Default-IP-Routing-Table)
>> Flag: 0x900
>>  Advertised to update-groups:
>>        2
>>  200, (Received from a RR-client)
>>    10.5.5.5 (metric 2) from 10.5.5.5 (10.5.5.5)
>>      Origin incomplete, metric 0, localpref 100, valid, internal, best
>>  200, (Received from a RR-client)
>>    10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>>      Origin incomplete, metric 0, localpref 100, valid, internal
>> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>> BGP routing table entry for 0.0.0.0/0, version 2
>> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>>  Advertised to update-groups:
>>        2
>>  Local, (Received from a RR-client)
>>    10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>>      Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#
>> 
>> ******************************************************
>>           Now roughly 180 seconds later
>> ******************************************************
>> 
>> Rack1R6#
>> %BGP-3-NOTIFICATION: received from neighbor 10.5.5.5 4/0 (hold time
>> expired) 0 bytes
>> Rack1R6#
>> %BGP-5-ADJCHANGE: neighbor 10.5.5.5 Down BGP Notification received
>> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> BGP routing table entry for 50.0.0.0/8, version 5
>> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> Flag: 0x900
>>  Advertised to update-groups:
>>        2
>>  200, (Received from a RR-client)
>>    10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>>      Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#
>> 
>> ******************************************************
>> But with the Selective NHT config this is the result
>> ******************************************************
>> 
>> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> BGP routing table entry for 50.0.0.0/8, version 5
>> Paths: (2 available, best #2, table Default-IP-Routing-Table)
>>  Advertised to update-groups:
>>        2
>>  200, (Received from a RR-client)
>>    10.5.5.5 (inaccessible) from 10.5.5.5 (10.5.5.5)
>>      Origin incomplete, metric 0, localpref 100, valid, internal
>>  200, (Received from a RR-client)
>>    10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>>      Origin incomplete, metric 0, localpref 100, valid, internal, best
>> Rack1R6#
>> 
>> 
>> 
>> 
>> 
>> 
>> On 11/30/12 6:21 PM, "John Neiberger" <jneiberger_at_gmail.com> wrote:7
>> 
>>> I posted this question to the Cisco NSP list and I've also talked to a
>>> couple of guys from Cisco Advanced Services and I'm still stumped about
>>> something. I'll try my best to phrase it in a way that makes sense.
>>> 
>>> Router A is learning about a prefix from two route reflector clients. In
>>> both cases, the next hop for the prefix is the loopback address of the
>>> advertising routers. Their loopback addresses are being advertised into
>>> OSPF.
>>> 
>>> So, from the perspective of Router A, it's BGP table for this prefix has
>>> two paths:
>>> 
>>> 1: 4.4.4.4  (loopback address of Router B, learned via OSPF) * winner due
>>> to lower IGP metric
>>> 2. 5.5.5.5 (loopback address of Router C, learned via OSPF)
>>> 
>>> Now for the weirdness to begin. A network event occurs that causes the
>>> loopback address of Router C to go away. This shouldn't affect Router A
>>> because it is already selecting the shortest path to the network via
>>> Router
>>> B (4.4.4.4).
>>> 
>>> However, Router A is also learning a default via BGP. That means that even
>>> though 5.5.5.5 (loopback of Router C) disappeared and is unreachable, the
>>> router is doing a recursive lookup and keeps the path in the BGP table;
>>> 5.5.5.5 is still reachable, it thinks, by using the default route.
>>> 
>>> The weird thing is that this causes Router A to start using the wrong
>>> path!
>>> It seems to be preferring a path with a next hop learned via BGP to a path
>>> with a next hop learned via OSPF. Why would it do this? I see no
>>> documentation that would explain why a BGP-learned next hop is preferred
>>> over an IGP-learned next hop.
>>> 
>>> Is the router still comparing IGP metrics even though the "wrong" path now
>>> has no IGP metric?
>>> 
>>> It's not changing due to router ID, cluster length, or neighbor IP
>>> address.
>>> I checked. So, why is it switching?
>>> 
>>> As soon as the BGP session from Router A to Router C times out, the
>>> extraneous path gets removed from the BGP table and the router goes back
>>> to
>>> using the correct path it should have been using all along.
>>> 
>>> So, is a BGP-learned next hop preferred over an IGP-learned next hop? If
>>> so, why? If not, any idea why my router switches paths? I've turned on BGP
>>> debugging and IP routing debugging and haven't found a suitable
>>> explanation
>>> for the switch.
>>> 
>>> John
>>> 
>>> 
>>> Blogs and organic groups at http://www.ccie.net
>>> 
>>> _______________________________________________________________________
>>> Subscription information may be found at:
>>> http://www.groupstudy.com/list/CCIELab.html
> 
> 
> Blogs and organic groups at http://www.ccie.net
> 
> _______________________________________________________________________
> Subscription information may be found at: 
> http://www.groupstudy.com/list/CCIELab.html
Blogs and organic groups at http://www.ccie.net
Received on Sat Dec 01 2012 - 08:41:34 ART
This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 09:36:53 ART