Re: BGP Path Selection weirdness regarding next hops from Joe Sanchez on 2012-12-01 (Ccielab archives 12/2012)

From: Joe Sanchez <marco207p_at_gmail.com>
Date: Sat, 01 Dec 2012 13:57:29 -0600

Marko,

Thanks for the info
I just thought I'd throw a little extra into the mix
so when B.D does his blog on it, he might add it to the cocktail
. :)

On 12/1/12 1:53 PM, "Marko Milivojevic" <markom_at_ipexpert.com> wrote:

>The main cause in this particular case was a recursive next-hop lookup
>for 0.0.0.0/0, which was the directly connected route on one of the
>RRs. It if was an OSPF-learned route, no acrobatics with the next-hop
>tracking would be needed.
>
>Another thing was the quasi-redundant setup that wasn't really
>redundant and it was a major contributor to the temporary outage
>(timing of it well explained by B.D.). Last but not the least was the
>nature of the failure, the solution for which is to really use BFD or
>other non-direct link failure detection (like for example LACP).
>
>It was really fun troubleshooting this last night and both John and I
>really wished we were doing it on XR or Junos, as IOS has really
>limited visibility into what was going on.
>
>--
>Marko Milivojevic - CCIE #18427 (SP R&S)
>Senior CCIE Instructor - IPexpert
>
>On Sat, Dec 1, 2012 at 11:49 AM, Joe Sanchez <marco207p_at_gmail.com> wrote:
>> The other aspect in this could deal with the OSPF. If router-A is a ABR
>> the route to 5.5.5.5 could be re-learned from router (Area0) which could
>> also add additional black hole time to the convergence.
>> John, you did mention that OSPF was in play here right? Can you provide
>> anything on the OSPF setup of this scenario.
>>
>> Regards,
>> JS
>>
>> On 12/1/12 11:44 AM, "John Neiberger" <jneiberger_at_gmail.com> wrote:
>>
>>>Sorry, I got sidetracked after we looked at it. Lol. I agree with you.
>>>There are a few design issues that conspired to cause this. Next hop
>>>tracking seems to be a really good solution, but I think a change of
>>>design
>>>might do the trick, as well. I'm going to talk to our network architects
>>>and mention both options.
>>>
>>>I really appreciate everyone's help! You guys are awesome.
>>>
>>>John
>>>On Dec 1, 2012 9:41 AM, "Marko Milivojevic" <markom_at_ipexpert.com> wrote:
>>>
>>>> I was hoping you'd post the findings of our webex session last night -
>>>> didn't want to steal the thunder :-)
>>>>
>>>> Personally, I think we've uncovered great many little things that
>>>> conspired to cause this :-)
>>>>
>>>> --
>>>> Marko Milivojevic - CCIE #18427 (SP R&S)
>>>> Senior CCIE Instructor - IPexpert
>>>>
>>>> :: This message was sent from a mobile device. I apologize for errors
>>>>and
>>>> brevity. ::
>>>>
>>>> On Dec 1, 2012, at 7:17, John Neiberger <jneiberger_at_gmail.com> wrote:
>>>>
>>>> > Yes, that is exactly what happened! I had never run into that
>>>>behavior.
>>>> The
>>>> > root cause of this weirdness for us was a linecard that had locked
>>>>up. It
>>>> > was no longer passing traffic, but it had not yet reloaded, so the
>>>> upstream
>>>> > neighbor timed out OSPF in 4 seconds but BGP stayed up for 180
>>>>seconds.
>>>> > During that time period, traffic for some destinations was
>>>>blackholed
>>>> > because of the presence of a supernet route that being used to
>>>>validate
>>>> the
>>>> > reachability of the next hop in BGP.
>>>> >
>>>> > I had never heard of Next Hop Tracking until yesterday when someone
>>>>from
>>>> > Cisco said that a similar problem was solved in IOS XR using this
>>>> feature,
>>>> > but I don't recall if they mentioned that this feature was available
>>>>in
>>>> > IOS. Our production routers run XR and my home lab is IOS. I'll lab
>>>>it up
>>>> > using NHT and use that as a proof-of-concept to the other engineers
>>>> > involved in tihs.
>>>> >
>>>> > Thanks go to you and to Marko for helping me through this. We were
>>>>pretty
>>>> > confused by it, partially because none of us observed the problem
>>>>while
>>>> it
>>>> > was happening. By the time network engineering was involved, the
>>>>problem
>>>> > had resolved. We just had to figure it out step-by-step in reverse,
>>>>which
>>>> > took a while.
>>>> >
>>>> > Thanks again to both of you!
>>>> > John
>>>> >
>>>> >
>>>> > On Sat, Dec 1, 2012 at 12:13 AM, Brian Dennis <bdennis_at_ine.com>
>>>>wrote:
>>>> >
>>>> >> John,
>>>> >> Let me see if I can sum this up:
>>>> >>
>>>> >> Two iBGP peers (let's call them Peer 1 and Peer 2) are advertising
>>>>the
>>>> >> same prefix via iBGP to Peer 3. From Peer 1 the NH (Next-Hop)
>>>>metric is
>>>> >> 10 to reach the prefix and from Peer 2 the NH metric is 20. The
>>>>BGP
>>>> >> decision process is selecting Peer 1 over Peer 2 due to the lower
>>>>NH
>>>> >> metric. You also have a default route learned via iBGP from
>>>>another
>>>> peer
>>>> >> and let's say it has a NH metric of 5.
>>>> >>
>>>> >> Now Peer 2 does down and within about 35 to 45 seconds the IGP
>>>>converges
>>>> >> and the NH is removed from Peer 3's RIB. Okay fine, so what, you
>>>>might
>>>> >> think as you're not using Peer 2 to reach that prefix anyways. The
>>>>BGP
>>>> >> peering session is still up due to the default BGP hold timers
>>>>being
>>>> >> 60/180 seconds in the IOS. So time wise we are about 60 seconds
>>>>after
>>>> the
>>>> >> failure of Peer 2.
>>>> >>
>>>> >> After the default delay timer of 5 seconds the NHT
>>>>(Next-Hop-Tracking)
>>>> in
>>>> >> BGP on Peer 3 kicks-in (or BGP scan process in older IOS versions)
>>>>to
>>>> look
>>>> >> and see if the NH to Peer 2 is still reachable via another route in
>>>>the
>>>> >> RIB. The only route available to reach the NH advertised by Peer 2
>>>>is
>>>> the
>>>> >> default route. The default route's NH metric is less than the NH
>>>>metric
>>>> >> to Peer 1 which is the current best path. This means that BGP
>>>>updates
>>>> the
>>>> >> NH metric to Peer 2's NH with the IGP metric to reach the default
>>>>route
>>>> >> (5). Now the prefix advertised by Peer 2 becomes the best path
>>>>since it
>>>> >> now has a lower NH metric. At this point we are about 66 seconds
>>>>after
>>>> >> the failure of Peer 2 but we still have 114 more seconds until BGP
>>>> detects
>>>> >> that Peer 2 is actually down so Peer 2's prefix is still useable by
>>>>BGP.
>>>> >> The real problem, as we know, is that Peer 2 is actually down but
>>>>Peer 3
>>>> >> will not detect it until the BGP hold time finally expires (180
>>>> seconds).
>>>> >> Once Peer 2 is declared down, Peer 3 will switch back to Peer 1's
>>>> prefix.
>>>> >> The problem comes in that traffic at best is sub-optimally routed
>>>>or
>>>> worst
>>>> >> "black holed" when this occurs as Peer 3 isn't using the "best"
>>>>path
>>>> (Peer
>>>> >> 1).
>>>> >>
>>>> >> This is actually a common problem with a simple solution. Just
>>>>don't
>>>> >> allow BGP to use any /31 or longer (or whatever length you want)
>>>>and/or
>>>> >> not use another BGP route to reach the next hop using BGP Selective
>>>>NHT.
>>>> >>
>>>> >>
>>>> >> router bgp 10
>>>> >> bgp nexthop route-map RM_NH_FILTER
>>>> >> !
>>>> >>
>>>> >> ip prefix-list PL_NH_FILTER seq 5 permit 0.0.0.0/0 le 31
>>>> >> !
>>>> >> route-map RM_NH_FILTER deny 10
>>>> >> match ip address prefix-list PL_NH_FILTER
>>>> >> !
>>>> >> route-map RM_NH_FILTER deny 20
>>>> >> match source-protocol bgp 10
>>>> >> !
>>>> >> route-map RM_NH_FILTER permit 30
>>>> >>
>>>> >>
>>>> >> With this configuration Peer 3 will show the NH to Peer 2's prefix
>>>>as
>>>> >> "inaccessible" as it can not use any route with a mask longer than
>>>>a
>>>>/32
>>>> >> and/or a route that's installed in the RIB via BGP. This can also
>>>>be
>>>> used
>>>> >> to not allow BGP to use a discard route for the NH which is another
>>>> common
>>>> >> problem.
>>>> >>
>>>> >>
>>>> >> I'll just do a quick blog post on this tomorrow. I labbed it all
>>>>up
>>>> >> already to verify what I talked about above but the post will have
>>>>to
>>>> wait
>>>> >> until tomorrow Wife is telling me it's 11pm and time to get off the
>>>> >> computer as it's Friday night ;-) There are a few minor caveats to
>>>>this
>>>> >> that I'll mention this weekend in the blog post. Also I pasted my
>>>>tests
>>>> >> below.
>>>> >>
>>>> >> --
>>>> >> Brian Dennis, CCIEx5 #2210 (R&S/ISP-Dial/Security/SP/Voice)
>>>> >> bdennis_at_ine.com
>>>> >>
>>>> >> INE, Inc.
>>>> >> http://www.INE.com
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> ******************************************************
>>>> >> R6 is our Peer 3 and R3 and R5 are Peer 1 and 2
>>>> >> ******************************************************
>>>> >>
>>>> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>>>> >> BGP routing table entry for 50.0.0.0/8, version 3
>>>> >> Paths: (2 available, best #2, table Default-IP-Routing-Table)
>>>> >> Advertised to update-groups:
>>>> >> 2
>>>> >> 200, (Received from a RR-client)
>>>> >> 10.5.5.5 (metric 20) from 10.5.5.5 (10.5.5.5)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal
>>>> >> 200, (Received from a RR-client)
>>>> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>>>best
>>>> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>>>> >> BGP routing table entry for 0.0.0.0/0, version 2
>>>> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>>>> >> Advertised to update-groups:
>>>> >> 2
>>>> >> Local, (Received from a RR-client)
>>>> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>>>best
>>>> >> Rack1R6#show ip route 10.3.3.3
>>>> >> Routing entry for 10.3.3.3/32
>>>> >> Known via "ospf 1", distance 110, metric 10, type intra area
>>>> >> Last update from 173.1.36.3 on GigabitEthernet0/0.36, 00:17:05 ago
>>>> >> Routing Descriptor Blocks:
>>>> >> * 173.1.36.3, from 10.3.3.3, 00:17:05 ago, via
>>>>GigabitEthernet0/0.36
>>>> >> Route metric is 10, traffic share count is 1
>>>> >>
>>>> >> Rack1R6#show ip route 10.5.5.5
>>>> >> Routing entry for 10.5.5.5/32
>>>> >> Known via "ospf 1", distance 110, metric 20, type intra area
>>>> >> Last update from 173.1.0.205 on GigabitEthernet0/0.456, 00:00:50
>>>>ago
>>>> >> Routing Descriptor Blocks:
>>>> >> * 173.1.0.205, from 10.5.5.5, 00:00:50 ago, via
>>>>GigabitEthernet0/0.456
>>>> >> Route metric is 20, traffic share count is 1
>>>> >>
>>>> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>>>> >> BGP routing table entry for 0.0.0.0/0, version 2
>>>> >> Paths: (1 available, best #1, table
>>>>Default-IP-Routing-Table)
>>>> >> Advertised to update-groups:
>>>> >> 2
>>>> >> Local, (Received from a RR-client)
>>>> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>>>> >> Origin incomplete, metric 0, localpref 100, valid,
>>>> internal,
>>>> >> best
>>>> >> Rack1R6#
>>>> >>
>>>> >> ******************************************************
>>>> >> Not let's shutdown the interface R5 uses to reach R6.
>>>> >> ******************************************************
>>>> >>
>>>> >> Rack1R5#conf t
>>>> >> Enter configuration commands, one per line. End with CNTL/Z.
>>>> >> Rack1R5(config)#int fa0/0.456
>>>> >> Rack1R5(config-subif)#shut
>>>> >> Rack1R5(config-subif)#^Z
>>>> >> Rack1R5#
>>>> >> %OSPF-5-ADJCHG: Process 1, Nbr 10.6.6.6 on FastEthernet0/0.456 from
>>>>FULL
>>>> >> to DOWN, Neighbor Down: Interface down or detached
>>>> >> %OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.10 on FastEthernet0/0.456
>>>>from
>>>> >> FULL to DOWN, Neighbor Down: Interface down or detached
>>>> >> %SYS-5-CONFIG_I: Configured from console by console
>>>> >> Rack1R5#
>>>> >>
>>>> >> ******************************************************
>>>> >> Now wait for OSPF to converge
>>>> >> ******************************************************
>>>> >>
>>>> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>>>> >> BGP routing table entry for 50.0.0.0/8, version 4
>>>> >> Paths: (2 available, best #1, table Default-IP-Routing-Table)
>>>> >> Flag: 0x900
>>>> >> Advertised to update-groups:
>>>> >> 2
>>>> >> 200, (Received from a RR-client)
>>>> >> 10.5.5.5 (metric 2) from 10.5.5.5 (10.5.5.5)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>>>best
>>>> >> 200, (Received from a RR-client)
>>>> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal
>>>> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0
>>>> >> BGP routing table entry for 0.0.0.0/0, version 2
>>>> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>>>> >> Advertised to update-groups:
>>>> >> 2
>>>> >> Local, (Received from a RR-client)
>>>> >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>>>best
>>>> >> Rack1R6#
>>>> >>
>>>> >> ******************************************************
>>>> >> Now roughly 180 seconds later
>>>> >> ******************************************************
>>>> >>
>>>> >> Rack1R6#
>>>> >> %BGP-3-NOTIFICATION: received from neighbor 10.5.5.5 4/0 (hold time
>>>> >> expired) 0 bytes
>>>> >> Rack1R6#
>>>> >> %BGP-5-ADJCHANGE: neighbor 10.5.5.5 Down BGP Notification received
>>>> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>>>> >> BGP routing table entry for 50.0.0.0/8, version 5
>>>> >> Paths: (1 available, best #1, table Default-IP-Routing-Table)
>>>> >> Flag: 0x900
>>>> >> Advertised to update-groups:
>>>> >> 2
>>>> >> 200, (Received from a RR-client)
>>>> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>>>best
>>>> >> Rack1R6#
>>>> >>
>>>> >> ******************************************************
>>>> >> But with the Selective NHT config this is the result
>>>> >> ******************************************************
>>>> >>
>>>> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8
>>>> >> BGP routing table entry for 50.0.0.0/8, version 5
>>>> >> Paths: (2 available, best #2, table Default-IP-Routing-Table)
>>>> >> Advertised to update-groups:
>>>> >> 2
>>>> >> 200, (Received from a RR-client)
>>>> >> 10.5.5.5 (inaccessible) from 10.5.5.5 (10.5.5.5)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal
>>>> >> 200, (Received from a RR-client)
>>>> >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3)
>>>> >> Origin incomplete, metric 0, localpref 100, valid, internal,
>>>>best
>>>> >> Rack1R6#
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On 11/30/12 6:21 PM, "John Neiberger" <jneiberger_at_gmail.com>
>>>>wrote:7
>>>> >>
>>>> >>> I posted this question to the Cisco NSP list and I've also talked
>>>>to a
>>>> >>> couple of guys from Cisco Advanced Services and I'm still stumped
>>>>about
>>>> >>> something. I'll try my best to phrase it in a way that makes
>>>>sense.
>>>> >>>
>>>> >>> Router A is learning about a prefix from two route reflector
>>>>clients.
>>>> In
>>>> >>> both cases, the next hop for the prefix is the loopback address of
>>>>the
>>>> >>> advertising routers. Their loopback addresses are being advertised
>>>>into
>>>> >>> OSPF.
>>>> >>>
>>>> >>> So, from the perspective of Router A, it's BGP table for this
>>>>prefix
>>>> has
>>>> >>> two paths:
>>>> >>>
>>>> >>> 1: 4.4.4.4 (loopback address of Router B, learned via OSPF) *
>>>>winner
>>>> due
>>>> >>> to lower IGP metric
>>>> >>> 2. 5.5.5.5 (loopback address of Router C, learned via OSPF)
>>>> >>>
>>>> >>> Now for the weirdness to begin. A network event occurs that causes
>>>>the
>>>> >>> loopback address of Router C to go away. This shouldn't affect
>>>>Router A
>>>> >>> because it is already selecting the shortest path to the network
>>>>via
>>>> >>> Router
>>>> >>> B (4.4.4.4).
>>>> >>>
>>>> >>> However, Router A is also learning a default via BGP. That means
>>>>that
>>>> even
>>>> >>> though 5.5.5.5 (loopback of Router C) disappeared and is
>>>>unreachable,
>>>> the
>>>> >>> router is doing a recursive lookup and keeps the path in the BGP
>>>>table;
>>>> >>> 5.5.5.5 is still reachable, it thinks, by using the default route.
>>>> >>>
>>>> >>> The weird thing is that this causes Router A to start using the
>>>>wrong
>>>> >>> path!
>>>> >>> It seems to be preferring a path with a next hop learned via BGP
>>>>to
>>>>a
>>>> path
>>>> >>> with a next hop learned via OSPF. Why would it do this? I see no
>>>> >>> documentation that would explain why a BGP-learned next hop is
>>>> preferred
>>>> >>> over an IGP-learned next hop.
>>>> >>>
>>>> >>> Is the router still comparing IGP metrics even though the "wrong"
>>>>path
>>>> now
>>>> >>> has no IGP metric?
>>>> >>>
>>>> >>> It's not changing due to router ID, cluster length, or neighbor IP
>>>> >>> address.
>>>> >>> I checked. So, why is it switching?
>>>> >>>
>>>> >>> As soon as the BGP session from Router A to Router C times out,
>>>>the
>>>> >>> extraneous path gets removed from the BGP table and the router
>>>>goes
>>>> back
>>>> >>> to
>>>> >>> using the correct path it should have been using all along.
>>>> >>>
>>>> >>> So, is a BGP-learned next hop preferred over an IGP-learned next
>>>>hop?
>>>> If
>>>> >>> so, why? If not, any idea why my router switches paths? I've
>>>>turned
>>>>on
>>>> BGP
>>>> >>> debugging and IP routing debugging and haven't found a suitable
>>>> >>> explanation
>>>> >>> for the switch.
>>>> >>>
>>>> >>> John
>>>> >>>
>>>> >>>
>>>> >>> Blogs and organic groups at http://www.ccie.net
>>>> >>>
>>>> >>>
>>>>_______________________________________________________________________
>>>> >>> Subscription information may be found at:
>>>> >>> http://www.groupstudy.com/list/CCIELab.html
>>>> >
>>>> >
>>>> > Blogs and organic groups at http://www.ccie.net
>>>> >
>>>> >
>>>>_______________________________________________________________________
>>>> > Subscription information may be found at:
>>>> > http://www.groupstudy.com/list/CCIELab.html
>>>
>>>
>>>Blogs and organic groups at http://www.ccie.net
>>>
>>>_______________________________________________________________________
>>>Subscription information may be found at:
>>>http://www.groupstudy.com/list/CCIELab.html

Blogs and organic groups at http://www.ccie.net
Received on Sat Dec 01 2012 - 13:57:29 ART

This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 09:36:53 ART