I was hoping you'd post the findings of our webex session last night - didn't want to steal the thunder :-)
Personally, I think we've uncovered great many little things that conspired to cause this :-)
-- Marko Milivojevic - CCIE #18427 (SP R&S) Senior CCIE Instructor - IPexpert :: This message was sent from a mobile device. I apologize for errors and brevity. :: On Dec 1, 2012, at 7:17, John Neiberger <jneiberger_at_gmail.com> wrote: > Yes, that is exactly what happened! I had never run into that behavior. The > root cause of this weirdness for us was a linecard that had locked up. It > was no longer passing traffic, but it had not yet reloaded, so the upstream > neighbor timed out OSPF in 4 seconds but BGP stayed up for 180 seconds. > During that time period, traffic for some destinations was blackholed > because of the presence of a supernet route that being used to validate the > reachability of the next hop in BGP. > > I had never heard of Next Hop Tracking until yesterday when someone from > Cisco said that a similar problem was solved in IOS XR using this feature, > but I don't recall if they mentioned that this feature was available in > IOS. Our production routers run XR and my home lab is IOS. I'll lab it up > using NHT and use that as a proof-of-concept to the other engineers > involved in tihs. > > Thanks go to you and to Marko for helping me through this. We were pretty > confused by it, partially because none of us observed the problem while it > was happening. By the time network engineering was involved, the problem > had resolved. We just had to figure it out step-by-step in reverse, which > took a while. > > Thanks again to both of you! > John > > > On Sat, Dec 1, 2012 at 12:13 AM, Brian Dennis <bdennis_at_ine.com> wrote: > >> John, >> Let me see if I can sum this up: >> >> Two iBGP peers (let's call them Peer 1 and Peer 2) are advertising the >> same prefix via iBGP to Peer 3. From Peer 1 the NH (Next-Hop) metric is >> 10 to reach the prefix and from Peer 2 the NH metric is 20. The BGP >> decision process is selecting Peer 1 over Peer 2 due to the lower NH >> metric. You also have a default route learned via iBGP from another peer >> and let's say it has a NH metric of 5. >> >> Now Peer 2 does down and within about 35 to 45 seconds the IGP converges >> and the NH is removed from Peer 3's RIB. Okay fine, so what, you might >> think as you're not using Peer 2 to reach that prefix anyways. The BGP >> peering session is still up due to the default BGP hold timers being >> 60/180 seconds in the IOS. So time wise we are about 60 seconds after the >> failure of Peer 2. >> >> After the default delay timer of 5 seconds the NHT (Next-Hop-Tracking) in >> BGP on Peer 3 kicks-in (or BGP scan process in older IOS versions) to look >> and see if the NH to Peer 2 is still reachable via another route in the >> RIB. The only route available to reach the NH advertised by Peer 2 is the >> default route. The default route's NH metric is less than the NH metric >> to Peer 1 which is the current best path. This means that BGP updates the >> NH metric to Peer 2's NH with the IGP metric to reach the default route >> (5). Now the prefix advertised by Peer 2 becomes the best path since it >> now has a lower NH metric. At this point we are about 66 seconds after >> the failure of Peer 2 but we still have 114 more seconds until BGP detects >> that Peer 2 is actually down so Peer 2's prefix is still useable by BGP. >> The real problem, as we know, is that Peer 2 is actually down but Peer 3 >> will not detect it until the BGP hold time finally expires (180 seconds). >> Once Peer 2 is declared down, Peer 3 will switch back to Peer 1's prefix. >> The problem comes in that traffic at best is sub-optimally routed or worst >> "black holed" when this occurs as Peer 3 isn't using the "best" path (Peer >> 1). >> >> This is actually a common problem with a simple solution. Just don't >> allow BGP to use any /31 or longer (or whatever length you want) and/or >> not use another BGP route to reach the next hop using BGP Selective NHT. >> >> >> router bgp 10 >> bgp nexthop route-map RM_NH_FILTER >> ! >> >> ip prefix-list PL_NH_FILTER seq 5 permit 0.0.0.0/0 le 31 >> ! >> route-map RM_NH_FILTER deny 10 >> match ip address prefix-list PL_NH_FILTER >> ! >> route-map RM_NH_FILTER deny 20 >> match source-protocol bgp 10 >> ! >> route-map RM_NH_FILTER permit 30 >> >> >> With this configuration Peer 3 will show the NH to Peer 2's prefix as >> "inaccessible" as it can not use any route with a mask longer than a /32 >> and/or a route that's installed in the RIB via BGP. This can also be used >> to not allow BGP to use a discard route for the NH which is another common >> problem. >> >> >> I'll just do a quick blog post on this tomorrow. I labbed it all up >> already to verify what I talked about above but the post will have to wait >> until tomorrow Wife is telling me it's 11pm and time to get off the >> computer as it's Friday night ;-) There are a few minor caveats to this >> that I'll mention this weekend in the blog post. Also I pasted my tests >> below. >> >> -- >> Brian Dennis, CCIEx5 #2210 (R&S/ISP-Dial/Security/SP/Voice) >> bdennis_at_ine.com >> >> INE, Inc. >> http://www.INE.com >> >> >> >> >> ****************************************************** >> R6 is our Peer 3 and R3 and R5 are Peer 1 and 2 >> ****************************************************** >> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8 >> BGP routing table entry for 50.0.0.0/8, version 3 >> Paths: (2 available, best #2, table Default-IP-Routing-Table) >> Advertised to update-groups: >> 2 >> 200, (Received from a RR-client) >> 10.5.5.5 (metric 20) from 10.5.5.5 (10.5.5.5) >> Origin incomplete, metric 0, localpref 100, valid, internal >> 200, (Received from a RR-client) >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3) >> Origin incomplete, metric 0, localpref 100, valid, internal, best >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0 >> BGP routing table entry for 0.0.0.0/0, version 2 >> Paths: (1 available, best #1, table Default-IP-Routing-Table) >> Advertised to update-groups: >> 2 >> Local, (Received from a RR-client) >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10) >> Origin incomplete, metric 0, localpref 100, valid, internal, best >> Rack1R6#show ip route 10.3.3.3 >> Routing entry for 10.3.3.3/32 >> Known via "ospf 1", distance 110, metric 10, type intra area >> Last update from 173.1.36.3 on GigabitEthernet0/0.36, 00:17:05 ago >> Routing Descriptor Blocks: >> * 173.1.36.3, from 10.3.3.3, 00:17:05 ago, via GigabitEthernet0/0.36 >> Route metric is 10, traffic share count is 1 >> >> Rack1R6#show ip route 10.5.5.5 >> Routing entry for 10.5.5.5/32 >> Known via "ospf 1", distance 110, metric 20, type intra area >> Last update from 173.1.0.205 on GigabitEthernet0/0.456, 00:00:50 ago >> Routing Descriptor Blocks: >> * 173.1.0.205, from 10.5.5.5, 00:00:50 ago, via GigabitEthernet0/0.456 >> Route metric is 20, traffic share count is 1 >> >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0 >> BGP routing table entry for 0.0.0.0/0, version 2 >> Paths: (1 available, best #1, table Default-IP-Routing-Table) >> Advertised to update-groups: >> 2 >> Local, (Received from a RR-client) >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10) >> Origin incomplete, metric 0, localpref 100, valid, internal, >> best >> Rack1R6# >> >> ****************************************************** >> Not let's shutdown the interface R5 uses to reach R6. >> ****************************************************** >> >> Rack1R5#conf t >> Enter configuration commands, one per line. End with CNTL/Z. >> Rack1R5(config)#int fa0/0.456 >> Rack1R5(config-subif)#shut >> Rack1R5(config-subif)#^Z >> Rack1R5# >> %OSPF-5-ADJCHG: Process 1, Nbr 10.6.6.6 on FastEthernet0/0.456 from FULL >> to DOWN, Neighbor Down: Interface down or detached >> %OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.10 on FastEthernet0/0.456 from >> FULL to DOWN, Neighbor Down: Interface down or detached >> %SYS-5-CONFIG_I: Configured from console by console >> Rack1R5# >> >> ****************************************************** >> Now wait for OSPF to converge >> ****************************************************** >> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8 >> BGP routing table entry for 50.0.0.0/8, version 4 >> Paths: (2 available, best #1, table Default-IP-Routing-Table) >> Flag: 0x900 >> Advertised to update-groups: >> 2 >> 200, (Received from a RR-client) >> 10.5.5.5 (metric 2) from 10.5.5.5 (10.5.5.5) >> Origin incomplete, metric 0, localpref 100, valid, internal, best >> 200, (Received from a RR-client) >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3) >> Origin incomplete, metric 0, localpref 100, valid, internal >> Rack1R6#show bgp ipv4 unicast 0.0.0.0/0 >> BGP routing table entry for 0.0.0.0/0, version 2 >> Paths: (1 available, best #1, table Default-IP-Routing-Table) >> Advertised to update-groups: >> 2 >> Local, (Received from a RR-client) >> 10.10.10.10 (metric 2) from 10.10.10.10 (10.10.10.10) >> Origin incomplete, metric 0, localpref 100, valid, internal, best >> Rack1R6# >> >> ****************************************************** >> Now roughly 180 seconds later >> ****************************************************** >> >> Rack1R6# >> %BGP-3-NOTIFICATION: received from neighbor 10.5.5.5 4/0 (hold time >> expired) 0 bytes >> Rack1R6# >> %BGP-5-ADJCHANGE: neighbor 10.5.5.5 Down BGP Notification received >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8 >> BGP routing table entry for 50.0.0.0/8, version 5 >> Paths: (1 available, best #1, table Default-IP-Routing-Table) >> Flag: 0x900 >> Advertised to update-groups: >> 2 >> 200, (Received from a RR-client) >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3) >> Origin incomplete, metric 0, localpref 100, valid, internal, best >> Rack1R6# >> >> ****************************************************** >> But with the Selective NHT config this is the result >> ****************************************************** >> >> Rack1R6#show bgp ipv4 unicast 50.0.0.0/8 >> BGP routing table entry for 50.0.0.0/8, version 5 >> Paths: (2 available, best #2, table Default-IP-Routing-Table) >> Advertised to update-groups: >> 2 >> 200, (Received from a RR-client) >> 10.5.5.5 (inaccessible) from 10.5.5.5 (10.5.5.5) >> Origin incomplete, metric 0, localpref 100, valid, internal >> 200, (Received from a RR-client) >> 10.3.3.3 (metric 10) from 10.3.3.3 (10.3.3.3) >> Origin incomplete, metric 0, localpref 100, valid, internal, best >> Rack1R6# >> >> >> >> >> >> >> On 11/30/12 6:21 PM, "John Neiberger" <jneiberger_at_gmail.com> wrote:7 >> >>> I posted this question to the Cisco NSP list and I've also talked to a >>> couple of guys from Cisco Advanced Services and I'm still stumped about >>> something. I'll try my best to phrase it in a way that makes sense. >>> >>> Router A is learning about a prefix from two route reflector clients. In >>> both cases, the next hop for the prefix is the loopback address of the >>> advertising routers. Their loopback addresses are being advertised into >>> OSPF. >>> >>> So, from the perspective of Router A, it's BGP table for this prefix has >>> two paths: >>> >>> 1: 4.4.4.4 (loopback address of Router B, learned via OSPF) * winner due >>> to lower IGP metric >>> 2. 5.5.5.5 (loopback address of Router C, learned via OSPF) >>> >>> Now for the weirdness to begin. A network event occurs that causes the >>> loopback address of Router C to go away. This shouldn't affect Router A >>> because it is already selecting the shortest path to the network via >>> Router >>> B (4.4.4.4). >>> >>> However, Router A is also learning a default via BGP. That means that even >>> though 5.5.5.5 (loopback of Router C) disappeared and is unreachable, the >>> router is doing a recursive lookup and keeps the path in the BGP table; >>> 5.5.5.5 is still reachable, it thinks, by using the default route. >>> >>> The weird thing is that this causes Router A to start using the wrong >>> path! >>> It seems to be preferring a path with a next hop learned via BGP to a path >>> with a next hop learned via OSPF. Why would it do this? I see no >>> documentation that would explain why a BGP-learned next hop is preferred >>> over an IGP-learned next hop. >>> >>> Is the router still comparing IGP metrics even though the "wrong" path now >>> has no IGP metric? >>> >>> It's not changing due to router ID, cluster length, or neighbor IP >>> address. >>> I checked. So, why is it switching? >>> >>> As soon as the BGP session from Router A to Router C times out, the >>> extraneous path gets removed from the BGP table and the router goes back >>> to >>> using the correct path it should have been using all along. >>> >>> So, is a BGP-learned next hop preferred over an IGP-learned next hop? If >>> so, why? If not, any idea why my router switches paths? I've turned on BGP >>> debugging and IP routing debugging and haven't found a suitable >>> explanation >>> for the switch. >>> >>> John >>> >>> >>> Blogs and organic groups at http://www.ccie.net >>> >>> _______________________________________________________________________ >>> Subscription information may be found at: >>> http://www.groupstudy.com/list/CCIELab.html > > > Blogs and organic groups at http://www.ccie.net > > _______________________________________________________________________ > Subscription information may be found at: > http://www.groupstudy.com/list/CCIELab.html Blogs and organic groups at http://www.ccie.netReceived on Sat Dec 01 2012 - 08:41:34 ART
This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 09:36:53 ART