Re: BGP Path Selection weirdness regarding next hops from John Neiberger on 2012-12-01 (Ccielab archives 12/2012)

From: John Neiberger <jneiberger_at_gmail.com>
Date: Sat, 1 Dec 2012 08:37:01 -0700

These particular routers are running 4.0.4 but will soon be upgraded to 4.2.

On Sat, Dec 1, 2012 at 8:36 AM, Brian McGahan <bmcgahan_at_ine.com> wrote:

> What XR version are you running?
>
> Brian McGahan, CCIE #8593 (R&S/SP/Security)
> bmcgahan_at_INE.com
>
>
> On Dec 1, 2012, at 9:18 AM, "John Neiberger" <jneiberger_at_gmail.com> wrote:
>
> > Yes, that is exactly what happened! I had never run into that behavior.
> The
> > root cause of this weirdness for us was a linecard that had locked up. It
> > was no longer passing traffic, but it had not yet reloaded, so the
> upstream
> > neighbor timed out OSPF in 4 seconds but BGP stayed up for 180 seconds.
> > During that time period, traffic for some destinations was blackholed
> > because of the presence of a supernet route that being used to validate
> the
> > reachability of the next hop in BGP.
> >
> > I had never heard of Next Hop Tracking until yesterday when someone from
> > Cisco said that a similar problem was solved in IOS XR using this
> feature,
> > but I don't recall if they mentioned that this feature was available in
> > IOS. Our production routers run XR and my home lab is IOS. I'll lab it up
> > using NHT and use that as a proof-of-concept to the other engineers
> > involved in tihs.
> >
> > Thanks go to you and to Marko for helping me through this. We were pretty
> > confused by it, partially because none of us observed the problem while
> it
> > was happening. By the time network engineering was involved, the problem
> > had resolved. We just had to figure it out step-by-step in reverse, which
> > took a while.
> >
> > Thanks again to both of you!
> > John
> >
> >
> > On Sat, Dec 1, 2012 at 12:13 AM, Brian Dennis <bdennis_at_ine.com> wrote:
> >
> >> John,
> >> Let me see if I can sum this up:
> >>
> >> Two iBGP peers (let's call them Peer 1 and Peer 2) are advertising the
> >> same prefix via iBGP to Peer 3. From Peer 1 the NH (Next-Hop) metric is
> >> 10 to reach the prefix and from Peer 2 the NH metric is 20. The BGP
> >> decision process is selecting Peer 1 over Peer 2 due to the lower NH
> >> metric. You also have a default route learned via iBGP from another
> peer
> >> and let's say it has a NH metric of 5.
> >>
> >> Now Peer 2 does down and within about 35 to 45 seconds the IGP converges
> >> and the NH is removed from Peer 3's RIB. Okay fine, so what, you might
> >> think as you're not using Peer 2 to reach that prefix anyways. The BGP
> >> peering session is still up due to the default BGP hold timers being
> >> 60/180 seconds in the IOS. So time wise we are about 60 seconds after
> the
> >> failure of Peer 2.
> >>
> >> After the default delay timer of 5 seconds the NHT (Next-Hop-Tracking)
> in
> >> BGP on Peer 3 kicks-in (or BGP scan process in older IOS versions) to
> look
> >> and see if the NH to Peer 2 is still reachable via another route in the
> >> RIB. The only route available to reach the NH advertised by Peer 2 is
> the
> >> default route. The default route's NH metric is less than the NH metric
> >> to Peer 1 which is the current best path. This means that BGP updates
> the
> >> NH metric to Peer 2's NH with the IGP metric to reach the default route
> >> (5). Now the prefix advertised by Peer 2 becomes the best path since it
> >> now has a lower NH metric. At this point we are about 66 seconds after
> >> the failure of Peer 2 but we still have 114 more seconds until BGP
> detects
> >> that Peer 2 is actually down so Peer 2's prefix is still useable by BGP.
> >> The real problem, as we know, is that Peer 2 is actually down but Peer 3
> >> will not detect it until the BGP hold time finally expires (180
> seconds).
> >> Once Peer 2 is declared down, Peer 3 will switch back to Peer 1's
> prefix.
> >> The problem comes in that traffic at best is sub-optimally routed or
> worst
> >> "black holed" when this occurs as Peer 3 isn't using the "best" path
> (Peer
> >> 1).
> >>
> >> This is actually a common problem with a simple solution. Just don't
> >> allow BGP to use any /31 or longer (or whatever length you want) and/or
> >> not use another BGP route to reach the next hop using BGP Selective NHT.
> >>
> >>
> >> router bgp 10
> >> bgp nexthop route-map RM_NH_FILTER
> >> !
> >>
> >> ip prefix-list PL_NH_FILTER seq 5 permit 0.0.0.0/0 le 31
> >> !
> >> route-map RM_NH_FILTER deny 10
> >> match ip address prefix-list PL_NH_FILTER
> >> !
> >> route-map RM_NH_FILTER deny 20
> >> match source-protocol bgp 10
> >> !
> >> route-map RM_NH_FILTER permit 30
> >>
> >>
> >> With this configuration Peer 3 will show the NH to Peer 2's prefix as
> >> "inaccessible" as it can not use any route with a mask longer than a /32
> >> and/or a route that's installed in the RIB via BGP. This can also be
> used
> >> to not allow BGP to use a discard route for the NH which is another
> common
> >> problem.
> >>
> >>
> >> I'll just do a quick blog post on this tomorrow. I labbed it all up
> >> already to verify what I talked about above but the post will have to
> wait
> >> until tomorrow Wife is telling me it's 11pm and time to get off the
> >> computer as it's Friday night ;-) There are a few minor caveats to this
> >> that I'll mention this weekend in the blog post. Also I pasted my tests
> >> below.
> >>
> >> --
> >> Brian Dennis, CCIEx5 #2210 (R&S/ISP-Dial/Security/SP/Voice)
> >> bdennis_at_ine.com
> >>
> >> INE, Inc.
> >> http://www.INE.com
> >>
> >>
> >>
> >>
> >> ******************************************************
> >> R6 is our Peer 3 and R3 and R5 are Peer 1 and 2
> >> ******************************************************
> >>
> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
> >> BGP routing table entry for 50.0.0.0/8, version 3
> >> Paths: (2 available, best #2, table Default-IP-Routing-Table)
> >> Advertised to update-groups:
> >> 2
> >> 200, (Received from a RR-client)
> >> 10.5.5.5 (metric 20) from 10.5.5.5 (10.5.5.5)
> >> Origin incomplete, metric 0, localpref 100, valid, internal
> >> 200, (Received from a RR-client)
> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
> >> Origin incomplete, metric 0, localpref 100, valid, internal, best
> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
> >> BGP routing table entry for 0.0.0.0/0, version 2
> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
> >> Advertised to update-groups:
> >> 2
> >> Local, (Received from a RR-client)
> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
> >> Origin incomplete, metric 0, localpref 100, valid, internal, best
> >> Rack1R6#show ip route 10.3.3.3
> >> Routing entry for 10.3.3.3/32
> >> Known via "ospf 1", distance 110, metric 10, type intra area
> >> Last update from 173.1.36.3 on GigabitEthernet0/0.36, 00:17:05 ago
> >> Routing Descriptor Blocks:
> >> * 173.1.36.3, from 10.3.3.3, 00:17:05 ago, via GigabitEthernet0/0.36
> >> Route metric is 10, traffic share count is 1
> >>
> >> Rack1R6#show ip route 10.5.5.5
> >> Routing entry for 10.5.5.5/32
> >> Known via "ospf 1", distance 110, metric 20, type intra area
> >> Last update from 173.1.0.205 on GigabitEthernet0/0.456, 00:00:50 ago
> >> Routing Descriptor Blocks:
> >> * 173.1.0.205, from 10.5.5.5, 00:00:50 ago, via GigabitEthernet0/0.456
> >> Route metric is 20, traffic share count is 1
> >>
> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
> >> BGP routing table entry for 0.0.0.0/0, version 2
> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
> >> Advertised to update-groups:
> >> 2
> >> Local, (Received from a RR-client)
> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
> >> Origin incomplete, metric 0, localpref 100, valid,
> internal,
> >> best
> >> Rack1R6#
> >>
> >> ******************************************************
> >> Not let's shutdown the interface R5 uses to reach R6.
> >> ******************************************************
> >>
> >> Rack1R5#conf t
> >> Enter configuration commands, one per line. End with CNTL/Z.
> >> Rack1R5(config)#int fa0/0.456
> >> Rack1R5(config-subif)#shut
> >> Rack1R5(config-subif)#^Z
> >> Rack1R5#
> >> %OSPF-5-ADJCHG: Process 1, Nbr 10.6.6.6 on FastEthernet0/0.456 from FULL
> >> to DOWN, Neighbor Down: Interface down or detached
> >> %OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.10 on FastEthernet0/0.456 from
> >> FULL to DOWN, Neighbor Down: Interface down or detached
> >> %SYS-5-CONFIG_I: Configured from console by console
> >> Rack1R5#
> >>
> >> ******************************************************
> >> Now wait for OSPF to converge
> >> ******************************************************
> >>
> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
> >> BGP routing table entry for 50.0.0.0/8, version 4
> >> Paths: (2 available, best #1, table Default-IP-Routing-Table)
> >> Flag: 0x900
> >> Advertised to update-groups:
> >> 2
> >> 200, (Received from a RR-client)
> >> 10.5.5.5 (metric 2) from 10.5.5.5 (10.5.5.5)
> >> Origin incomplete, metric 0, localpref 100, valid, internal, best
> >> 200, (Received from a RR-client)
> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
> >> Origin incomplete, metric 0, localpref 100, valid, internal
> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
> >> BGP routing table entry for 0.0.0.0/0, version 2
> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
> >> Advertised to update-groups:
> >> 2
> >> Local, (Received from a RR-client)
> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
> >> Origin incomplete, metric 0, localpref 100, valid, internal, best
> >> Rack1R6#
> >>
> >> ******************************************************
> >> Now roughly 180 seconds later
> >> ******************************************************
> >>
> >> Rack1R6#
> >> %BGP-3-NOTIFICATION: received from neighbor 10.5.5.5 4/0 (hold time
> >> expired) 0 bytes
> >> Rack1R6#
> >> %BGP-5-ADJCHANGE: neighbor 10.5.5.5 Down BGP Notification received
> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
> >> BGP routing table entry for 50.0.0.0/8, version 5
> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
> >> Flag: 0x900
> >> Advertised to update-groups:
> >> 2
> >> 200, (Received from a RR-client)
> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
> >> Origin incomplete, metric 0, localpref 100, valid, internal, best
> >> Rack1R6#
> >>
> >> ******************************************************
> >> But with the Selective NHT config this is the result
> >> ******************************************************
> >>
> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
> >> BGP routing table entry for 50.0.0.0/8, version 5
> >> Paths: (2 available, best #2, table Default-IP-Routing-Table)
> >> Advertised to update-groups:
> >> 2
> >> 200, (Received from a RR-client)
> >> 10.5.5.5 (inaccessible) from 10.5.5.5 (10.5.5.5)
> >> Origin incomplete, metric 0, localpref 100, valid, internal
> >> 200, (Received from a RR-client)
> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
> >> Origin incomplete, metric 0, localpref 100, valid, internal, best
> >> Rack1R6#
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 11/30/12 6:21 PM, "John Neiberger" <jneiberger_at_gmail.com> wrote:7
> >>
> >>> I posted this question to the Cisco NSP list and I've also talked to a
> >>> couple of guys from Cisco Advanced Services and I'm still stumped about
> >>> something. I'll try my best to phrase it in a way that makes sense.
> >>>
> >>> Router A is learning about a prefix from two route reflector clients.
> In
> >>> both cases, the next hop for the prefix is the loopback address of the
> >>> advertising routers. Their loopback addresses are being advertised into
> >>> OSPF.
> >>>
> >>> So, from the perspective of Router A, it's BGP table for this prefix
> has
> >>> two paths:
> >>>
> >>> 1: 4.4.4.4 (loopback address of Router B, learned via OSPF) * winner
> due
> >>> to lower IGP metric
> >>> 2. 5.5.5.5 (loopback address of Router C, learned via OSPF)
> >>>
> >>> Now for the weirdness to begin. A network event occurs that causes the
> >>> loopback address of Router C to go away. This shouldn't affect Router A
> >>> because it is already selecting the shortest path to the network via
> >>> Router
> >>> B (4.4.4.4).
> >>>
> >>> However, Router A is also learning a default via BGP. That means that
> even
> >>> though 5.5.5.5 (loopback of Router C) disappeared and is unreachable,
> the
> >>> router is doing a recursive lookup and keeps the path in the BGP table;
> >>> 5.5.5.5 is still reachable, it thinks, by using the default route.
> >>>
> >>> The weird thing is that this causes Router A to start using the wrong
> >>> path!
> >>> It seems to be preferring a path with a next hop learned via BGP to a
> path
> >>> with a next hop learned via OSPF. Why would it do this? I see no
> >>> documentation that would explain why a BGP-learned next hop is
> preferred
> >>> over an IGP-learned next hop.
> >>>
> >>> Is the router still comparing IGP metrics even though the "wrong" path
> now
> >>> has no IGP metric?
> >>>
> >>> It's not changing due to router ID, cluster length, or neighbor IP
> >>> address.
> >>> I checked. So, why is it switching?
> >>>
> >>> As soon as the BGP session from Router A to Router C times out, the
> >>> extraneous path gets removed from the BGP table and the router goes
> back
> >>> to
> >>> using the correct path it should have been using all along.
> >>>
> >>> So, is a BGP-learned next hop preferred over an IGP-learned next hop?
> If
> >>> so, why? If not, any idea why my router switches paths? I've turned on
> BGP
> >>> debugging and IP routing debugging and haven't found a suitable
> >>> explanation
> >>> for the switch.
> >>>
> >>> John
> >>>
> >>>
> >>> Blogs and organic groups at http://www.ccie.net
> >>>
> >>> _______________________________________________________________________
> >>> Subscription information may be found at:
> >>> http://www.groupstudy.com/list/CCIELab.html
> >
> >
> > Blogs and organic groups at http://www.ccie.net
> >
> > _______________________________________________________________________
> > Subscription information may be found at:
> > http://www.groupstudy.com/list/CCIELab.html

Blogs and organic groups at http://www.ccie.net
Received on Sat Dec 01 2012 - 08:37:01 ART

This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 09:36:53 ART