Re: BGP Path Selection weirdness regarding next hops

From: Joe Sanchez <marco207p_at_gmail.com>
Date: Sat, 01 Dec 2012 13:49:18 -0600

The other aspect in this could deal with the OSPF. If router-A is a ABR
the route to 5.5.5.5 could be re-learned from router (Area0) which could
also add additional black hole time to the convergence.
John, you did mention that OSPF was in play here right? Can you provide
anything on the OSPF setup of this scenario.

Regards,
JS

On 12/1/12 11:44 AM, "John Neiberger" <jneiberger_at_gmail.com> wrote:

>Sorry, I got sidetracked after we looked at it. Lol. I agree with you.
>There are a few design issues that conspired to cause this. Next hop
>tracking seems to be a really good solution, but I think a change of
>design
>might do the trick, as well. I'm going to talk to our network architects
>and mention both options.
>
>I really appreciate everyone's help! You guys are awesome.
>
>John
>On Dec 1, 2012 9:41 AM, "Marko Milivojevic" <markom_at_ipexpert.com> wrote:
>
>> I was hoping you'd post the findings of our webex session last night -
>> didn't want to steal the thunder :-)
>>
>> Personally, I think we've uncovered great many little things that
>> conspired to cause this :-)
>>
>> --
>> Marko Milivojevic - CCIE #18427 (SP R&S)
>> Senior CCIE Instructor - IPexpert
>>
>> :: This message was sent from a mobile device. I apologize for errors
>>and
>> brevity. ::
>>
>> On Dec 1, 2012, at 7:17, John Neiberger <jneiberger_at_gmail.com> wrote:
>>
>> > Yes, that is exactly what happened! I had never run into that
>>behavior.
>> The
>> > root cause of this weirdness for us was a linecard that had locked
>>up. It
>> > was no longer passing traffic, but it had not yet reloaded, so the
>> upstream
>> > neighbor timed out OSPF in 4 seconds but BGP stayed up for 180
>>seconds.
>> > During that time period, traffic for some destinations was blackholed
>> > because of the presence of a supernet route that being used to
>>validate
>> the
>> > reachability of the next hop in BGP.
>> >
>> > I had never heard of Next Hop Tracking until yesterday when someone
>>from
>> > Cisco said that a similar problem was solved in IOS XR using this
>> feature,
>> > but I don't recall if they mentioned that this feature was available
>>in
>> > IOS. Our production routers run XR and my home lab is IOS. I'll lab
>>it up
>> > using NHT and use that as a proof-of-concept to the other engineers
>> > involved in tihs.
>> >
>> > Thanks go to you and to Marko for helping me through this. We were
>>pretty
>> > confused by it, partially because none of us observed the problem
>>while
>> it
>> > was happening. By the time network engineering was involved, the
>>problem
>> > had resolved. We just had to figure it out step-by-step in reverse,
>>which
>> > took a while.
>> >
>> > Thanks again to both of you!
>> > John
>> >
>> >
>> > On Sat, Dec 1, 2012 at 12:13 AM, Brian Dennis <bdennis_at_ine.com> wrote:
>> >
>> >> John,
>> >> Let me see if I can sum this up:
>> >>
>> >> Two iBGP peers (let's call them Peer 1 and Peer 2) are advertising
>>the
>> >> same prefix via iBGP to Peer 3. From Peer 1 the NH (Next-Hop)
>>metric is
>> >> 10 to reach the prefix and from Peer 2 the NH metric is 20. The BGP
>> >> decision process is selecting Peer 1 over Peer 2 due to the lower NH
>> >> metric. You also have a default route learned via iBGP from another
>> peer
>> >> and let's say it has a NH metric of 5.
>> >>
>> >> Now Peer 2 does down and within about 35 to 45 seconds the IGP
>>converges
>> >> and the NH is removed from Peer 3's RIB. Okay fine, so what, you
>>might
>> >> think as you're not using Peer 2 to reach that prefix anyways. The
>>BGP
>> >> peering session is still up due to the default BGP hold timers being
>> >> 60/180 seconds in the IOS. So time wise we are about 60 seconds
>>after
>> the
>> >> failure of Peer 2.
>> >>
>> >> After the default delay timer of 5 seconds the NHT
>>(Next-Hop-Tracking)
>> in
>> >> BGP on Peer 3 kicks-in (or BGP scan process in older IOS versions) to
>> look
>> >> and see if the NH to Peer 2 is still reachable via another route in
>>the
>> >> RIB. The only route available to reach the NH advertised by Peer 2
>>is
>> the
>> >> default route. The default route's NH metric is less than the NH
>>metric
>> >> to Peer 1 which is the current best path. This means that BGP
>>updates
>> the
>> >> NH metric to Peer 2's NH with the IGP metric to reach the default
>>route
>> >> (5). Now the prefix advertised by Peer 2 becomes the best path
>>since it
>> >> now has a lower NH metric. At this point we are about 66 seconds
>>after
>> >> the failure of Peer 2 but we still have 114 more seconds until BGP
>> detects
>> >> that Peer 2 is actually down so Peer 2's prefix is still useable by
>>BGP.
>> >> The real problem, as we know, is that Peer 2 is actually down but
>>Peer 3
>> >> will not detect it until the BGP hold time finally expires (180
>> seconds).
>> >> Once Peer 2 is declared down, Peer 3 will switch back to Peer 1's
>> prefix.
>> >> The problem comes in that traffic at best is sub-optimally routed or
>> worst
>> >> "black holed" when this occurs as Peer 3 isn't using the "best" path
>> (Peer
>> >> 1).
>> >>
>> >> This is actually a common problem with a simple solution. Just don't
>> >> allow BGP to use any /31 or longer (or whatever length you want)
>>and/or
>> >> not use another BGP route to reach the next hop using BGP Selective
>>NHT.
>> >>
>> >>
>> >> router bgp 10
>> >> bgp nexthop route-map RM_NH_FILTER
>> >> !
>> >>
>> >> ip prefix-list PL_NH_FILTER seq 5 permit 0.0.0.0/0 le 31
>> >> !
>> >> route-map RM_NH_FILTER deny 10
>> >> match ip address prefix-list PL_NH_FILTER
>> >> !
>> >> route-map RM_NH_FILTER deny 20
>> >> match source-protocol bgp 10
>> >> !
>> >> route-map RM_NH_FILTER permit 30
>> >>
>> >>
>> >> With this configuration Peer 3 will show the NH to Peer 2's prefix as
>> >> "inaccessible" as it can not use any route with a mask longer than a
>>/32
>> >> and/or a route that's installed in the RIB via BGP. This can also be
>> used
>> >> to not allow BGP to use a discard route for the NH which is another
>> common
>> >> problem.
>> >>
>> >>
>> >> I'll just do a quick blog post on this tomorrow. I labbed it all up
>> >> already to verify what I talked about above but the post will have to
>> wait
>> >> until tomorrow Wife is telling me it's 11pm and time to get off the
>> >> computer as it's Friday night ;-) There are a few minor caveats to
>>this
>> >> that I'll mention this weekend in the blog post. Also I pasted my
>>tests
>> >> below.
>> >>
>> >> --
>> >> Brian Dennis, CCIEx5 #2210 (R&S/ISP-Dial/Security/SP/Voice)
>> >> bdennis_at_ine.com
>> >>
>> >> INE, Inc.
>> >> http://www.INE.com
>> >>
>> >>
>> >>
>> >>
>> >> ******************************************************
>> >> R6 is our Peer 3 and R3 and R5 are Peer 1 and 2
>> >> ******************************************************
>> >>
>> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> >> BGP routing table entry for 50.0.0.0/8, version 3
>> >> Paths: (2 available, best #2, table Default-IP-Routing-Table)
>> >> Advertised to update-groups:
>> >> 2
>> >> 200, (Received from a RR-client)
>> >> 10.5.5.5 (metric 20) from 10.5.5.5 (10.5.5.5)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal
>> >> 200, (Received from a RR-client)
>> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>best
>> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>> >> BGP routing table entry for 0.0.0.0/0, version 2
>> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> >> Advertised to update-groups:
>> >> 2
>> >> Local, (Received from a RR-client)
>> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>best
>> >> Rack1R6#show ip route 10.3.3.3
>> >> Routing entry for 10.3.3.3/32
>> >> Known via "ospf 1", distance 110, metric 10, type intra area
>> >> Last update from 173.1.36.3 on GigabitEthernet0/0.36, 00:17:05 ago
>> >> Routing Descriptor Blocks:
>> >> * 173.1.36.3, from 10.3.3.3, 00:17:05 ago, via GigabitEthernet0/0.36
>> >> Route metric is 10, traffic share count is 1
>> >>
>> >> Rack1R6#show ip route 10.5.5.5
>> >> Routing entry for 10.5.5.5/32
>> >> Known via "ospf 1", distance 110, metric 20, type intra area
>> >> Last update from 173.1.0.205 on GigabitEthernet0/0.456, 00:00:50 ago
>> >> Routing Descriptor Blocks:
>> >> * 173.1.0.205, from 10.5.5.5, 00:00:50 ago, via
>>GigabitEthernet0/0.456
>> >> Route metric is 20, traffic share count is 1
>> >>
>> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>> >> BGP routing table entry for 0.0.0.0/0, version 2
>> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> >> Advertised to update-groups:
>> >> 2
>> >> Local, (Received from a RR-client)
>> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>> >> Origin incomplete, metric 0, localpref 100, valid,
>> internal,
>> >> best
>> >> Rack1R6#
>> >>
>> >> ******************************************************
>> >> Not let's shutdown the interface R5 uses to reach R6.
>> >> ******************************************************
>> >>
>> >> Rack1R5#conf t
>> >> Enter configuration commands, one per line. End with CNTL/Z.
>> >> Rack1R5(config)#int fa0/0.456
>> >> Rack1R5(config-subif)#shut
>> >> Rack1R5(config-subif)#^Z
>> >> Rack1R5#
>> >> %OSPF-5-ADJCHG: Process 1, Nbr 10.6.6.6 on FastEthernet0/0.456 from
>>FULL
>> >> to DOWN, Neighbor Down: Interface down or detached
>> >> %OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.10 on FastEthernet0/0.456
>>from
>> >> FULL to DOWN, Neighbor Down: Interface down or detached
>> >> %SYS-5-CONFIG_I: Configured from console by console
>> >> Rack1R5#
>> >>
>> >> ******************************************************
>> >> Now wait for OSPF to converge
>> >> ******************************************************
>> >>
>> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> >> BGP routing table entry for 50.0.0.0/8, version 4
>> >> Paths: (2 available, best #1, table Default-IP-Routing-Table)
>> >> Flag: 0x900
>> >> Advertised to update-groups:
>> >> 2
>> >> 200, (Received from a RR-client)
>> >> 10.5.5.5 (metric 2) from 10.5.5.5 (10.5.5.5)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>best
>> >> 200, (Received from a RR-client)
>> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal
>> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>> >> BGP routing table entry for 0.0.0.0/0, version 2
>> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> >> Advertised to update-groups:
>> >> 2
>> >> Local, (Received from a RR-client)
>> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>best
>> >> Rack1R6#
>> >>
>> >> ******************************************************
>> >> Now roughly 180 seconds later
>> >> ******************************************************
>> >>
>> >> Rack1R6#
>> >> %BGP-3-NOTIFICATION: received from neighbor 10.5.5.5 4/0 (hold time
>> >> expired) 0 bytes
>> >> Rack1R6#
>> >> %BGP-5-ADJCHANGE: neighbor 10.5.5.5 Down BGP Notification received
>> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> >> BGP routing table entry for 50.0.0.0/8, version 5
>> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>> >> Flag: 0x900
>> >> Advertised to update-groups:
>> >> 2
>> >> 200, (Received from a RR-client)
>> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>best
>> >> Rack1R6#
>> >>
>> >> ******************************************************
>> >> But with the Selective NHT config this is the result
>> >> ******************************************************
>> >>
>> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>> >> BGP routing table entry for 50.0.0.0/8, version 5
>> >> Paths: (2 available, best #2, table Default-IP-Routing-Table)
>> >> Advertised to update-groups:
>> >> 2
>> >> 200, (Received from a RR-client)
>> >> 10.5.5.5 (inaccessible) from 10.5.5.5 (10.5.5.5)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal
>> >> 200, (Received from a RR-client)
>> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>best
>> >> Rack1R6#
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 11/30/12 6:21 PM, "John Neiberger" <jneiberger_at_gmail.com> wrote:7
>> >>
>> >>> I posted this question to the Cisco NSP list and I've also talked
>>to a
>> >>> couple of guys from Cisco Advanced Services and I'm still stumped
>>about
>> >>> something. I'll try my best to phrase it in a way that makes sense.
>> >>>
>> >>> Router A is learning about a prefix from two route reflector
>>clients.
>> In
>> >>> both cases, the next hop for the prefix is the loopback address of
>>the
>> >>> advertising routers. Their loopback addresses are being advertised
>>into
>> >>> OSPF.
>> >>>
>> >>> So, from the perspective of Router A, it's BGP table for this prefix
>> has
>> >>> two paths:
>> >>>
>> >>> 1: 4.4.4.4 (loopback address of Router B, learned via OSPF) *
>>winner
>> due
>> >>> to lower IGP metric
>> >>> 2. 5.5.5.5 (loopback address of Router C, learned via OSPF)
>> >>>
>> >>> Now for the weirdness to begin. A network event occurs that causes
>>the
>> >>> loopback address of Router C to go away. This shouldn't affect
>>Router A
>> >>> because it is already selecting the shortest path to the network via
>> >>> Router
>> >>> B (4.4.4.4).
>> >>>
>> >>> However, Router A is also learning a default via BGP. That means
>>that
>> even
>> >>> though 5.5.5.5 (loopback of Router C) disappeared and is
>>unreachable,
>> the
>> >>> router is doing a recursive lookup and keeps the path in the BGP
>>table;
>> >>> 5.5.5.5 is still reachable, it thinks, by using the default route.
>> >>>
>> >>> The weird thing is that this causes Router A to start using the
>>wrong
>> >>> path!
>> >>> It seems to be preferring a path with a next hop learned via BGP to
>>a
>> path
>> >>> with a next hop learned via OSPF. Why would it do this? I see no
>> >>> documentation that would explain why a BGP-learned next hop is
>> preferred
>> >>> over an IGP-learned next hop.
>> >>>
>> >>> Is the router still comparing IGP metrics even though the "wrong"
>>path
>> now
>> >>> has no IGP metric?
>> >>>
>> >>> It's not changing due to router ID, cluster length, or neighbor IP
>> >>> address.
>> >>> I checked. So, why is it switching?
>> >>>
>> >>> As soon as the BGP session from Router A to Router C times out, the
>> >>> extraneous path gets removed from the BGP table and the router goes
>> back
>> >>> to
>> >>> using the correct path it should have been using all along.
>> >>>
>> >>> So, is a BGP-learned next hop preferred over an IGP-learned next
>>hop?
>> If
>> >>> so, why? If not, any idea why my router switches paths? I've turned
>>on
>> BGP
>> >>> debugging and IP routing debugging and haven't found a suitable
>> >>> explanation
>> >>> for the switch.
>> >>>
>> >>> John
>> >>>
>> >>>
>> >>> Blogs and organic groups at http://www.ccie.net
>> >>>
>> >>>
>>_______________________________________________________________________
>> >>> Subscription information may be found at:
>> >>> http://www.groupstudy.com/list/CCIELab.html
>> >
>> >
>> > Blogs and organic groups at http://www.ccie.net
>> >
>> >
>>_______________________________________________________________________
>> > Subscription information may be found at:
>> > http://www.groupstudy.com/list/CCIELab.html
>
>
>Blogs and organic groups at http://www.ccie.net
>
>_______________________________________________________________________
>Subscription information may be found at:
>http://www.groupstudy.com/list/CCIELab.html

Blogs and organic groups at http://www.ccie.net
Received on Sat Dec 01 2012 - 13:49:18 ART

This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 09:36:53 ART