RE: EEM to keep BGP peer shut during an interface flap

From: Brian McGahan <bmcgahan_at_ine.com>
Date: Thu, 29 Aug 2013 07:49:51 -0500

I actually do something similar on our edge routers, ping the Anycast addresses of root DNS, and if unreachable, reroute around it. What Jon said is true, that "pinging things outside of your administrative control makes the behavior of your network beholden to an external entities security policies". There is a simple fix for this though; don't use ping, use a DNS query. If the root Anycast DNS servers don't respond to query and are down, the Internet is truly down, i.e. there has been globalthermonuclear war :)

Joe one change you might consider in your policy though: 8.8.8.8 and 8.8.4.4 aren't root servers, they're Google's servers. However unlikely it is that Google's servers will go down, it is possible (http://news.sky.com/story/1129847/google-outage-internet-traffic-plunges-40-percent). You'd be better off pointing at the top level Anycast addresses, which you can find here: http://www.root-servers.org/

For example the "J" server 192.58.128.30 hosted by Verisign has 70 instances; the "L" server 199.7.83.42 hosted by ICANN has 146 instances. If your SLA/EEM policy says the pseudocode IF 192.58.128.30 == DOWN && IF 199.7.83.42 == DOWN THEN REROUTE = TRUE, you're basically saying that your upstream provided has lost connectivity to 200+ instances of top level DNS, i.e. their network is broken.

So in the end if your goal is application reachability why would you not use both tools like BFD plus EEM? Like Joe, I've learned this lesson the hard way in the past. Just because your BGP peering is working to your upstream neighbor doesn't mean that they don't have upstream routing issues or that their upstream peers don't have issues.

Also whoever was originally asking the question about the script:

>>> event manager applet CHECK-PING-STATUS event none action 11.1 cli
>>> command "ping 2.2.2.2"
>>> action 11.2 regexp "(.*) (!\!\!\!\!) (.*)" "$_cli_result" _match
>>> _sub1 action 11.3 if $_regexp_result eq 1 action 11.4 syslog msg
>>> "Ping is success"
>>> action 11.5 else
>>> action 11.6 syslog msg "Ping is failed"
>>> action 11.7 end

You're going too far out of the way to solve a simpler problem. The router already has this built in with IP SLA and Enhanced Object Tracking. IP SLA can ping/dns query/tcp connect/etc. to your application and report its status to an object. EEM can then poll the object:

http://www.cisco.com/en/US/docs/ios/12_2sr/12_2srb/feature/guide/srbeotem.html

"Enhanced Object Tracking (EOT) is now integrated with Embedded Event Manager (EEM) to allow EEM to report on a status change of a tracked object and to allow enhanced object tracking to track EEM objects."

HTH,

Brian McGahan, 4 x CCIE #8593 (R&S/SP/SC/DC), CCDE #2013::13
bmcgahan_at_INE.com

Internetwork Expert, Inc.
http://www.INE.com

-----Original Message-----
From: nobody_at_groupstudy.com [mailto:nobody_at_groupstudy.com] On Behalf Of Joseph L. Brunner
Sent: Wednesday, August 28, 2013 11:52 PM
To: 'jon.hartman_at_verizon.net'; 'jay.mcmickle_at_yahoo.com'; 'chris.rae07_at_me.com'
Cc: 'mathewfer_at_gmail.com'; 'marco207p_at_gmail.com'; 'jneiberger_at_gmail.com'; 'ccielab_at_groupstudy.com'
Subject: Re: EEM to keep BGP peer shut during an interface flap

Jon,

This morning 8/28, 11am edt, Level3, which is the nation's leading ISP (and carries up to 2/3 of the internet traffic in many "fed reserve bank cities") was so broken on bgp traffic between Verizon (the nation's largest telephone provide) from a "money center bank" I won't name that uses verizon and our trade gateway in Secaucus Level3 the internet traffic went from Manhattan NYC, to Stockholm SWE, to Secaucus, NJ...

Now, I know you got your ccie and "umbra" clearance I will never have. But I'm pretty sure I see more networks on an average day supporting 500 random companies in 12 countries that call us than most of ccie's in "your unit".

Take my advise - trust no one. Especially a $6 Billion dollar isp owned by (warren buffet?) with bfd and end to really end (meaning your customer)'s backbone using meager first hop technologies designed for an enterprise routing scenario...

Think bfd would have stopped (or caught this) (during which our gigabit internet links were effectively down to our customers coming from other isp's???)

This traceroute is from Manhattan to Secaucus 14 hours ago or sob& B C:\>tracert 8.30.204.228 B Tracing route to 8.30.204.228 over a maximum of 30 hops B B 1B B B B 1 msB B B <1 msB B B <1 msB 192.168.15.22 B 2B B B B 2 msB B B B 1 msB B B B 1 msB 63.ge-4-0-1.GW14.NYC1.ALTER.NET [152.179.246.205] B 3B B B B 1 msB B B B 1 msB B B B 1 msB 0.xe-7-1-0.XL3.NYC1.ALTER.NET [152.63.4.150] B 4B B B B 2 msB B B B 2 msB B B B 1 msB 0.xe-5-0-0.BR1.NYC1.ALTER.NET [152.63.16.61] B 5B B B B *B B B B B B B *B B B B B B B *B B B B Request timed out.
B 6B B B B *B B B B B 441 msB B 440 msB vlan51.ebr1.NewYork2.Level3.net [4.69.138.222] B 7B B 439 msB B 440 msB B 437 msB ae-44-44.ebr1.Stockholm2.Level3.net [4.69.201.41] B 8B B 138 msB B 138 msB B B B *B B B B ae-2-2.ebr1.Newark1.Level3.net [4.69.132.98] B 9B B 430 msB B 436 msB B 434 msB ae-1-51.edge2.Newark1.Level3.net [4.69.156.9] 10B B 322 msB B 308 msB B 296 msB 4.30.130.2 11B B 377 msB B 369 msB B 362 msB 8.30.204.228 B

Words of wisdom...

----- Original Message -----
From: Jon Hartman [mailto:jon.hartman_at_verizon.net]
Sent: Thursday, August 29, 2013 12:30 AM
To: 'Jay McMickle' <jay.mcmickle_at_yahoo.com>; 'Christopher Rae' <chris.rae07_at_me.com>
Cc: Joseph L. Brunner; mathewfer_at_gmail.com <mathewfer_at_gmail.com>; marco207p_at_gmail.com <marco207p_at_gmail.com>; jneiberger_at_gmail.com <jneiberger_at_gmail.com>; ccielab_at_groupstudy.com <ccielab_at_groupstudy.com>
Subject: RE: EEM to keep BGP peer shut during an interface flap

Joe, there was a time when I would have happily joined your flame war on a professional forum, but I'm above that and someday I hope you are too. TDM is still alive and he didn't specify he was tracking someone else's interface. Your comment about the issues with intermediary devices in Ethernet access are valid, but aren't news, either. Likewise, "why only EEM"
isn't the same as "what is EEM," but I'll extend the benefit of the doubt and assume you were having an off day.

Regarding the issue at hand, if the goal is to minimize flapping, while keeping your capacity up to par, something simple would be using cumulative penalty multiplier. I'm a fan of returning from failure scenarios in controlled manners in maintenance windows, but I've worked for clients that consider simplex mode an outage or have exceeded their 50% limit for maintaining redundancy.

In the below script, it'll keep track of failures and increase the penalty.
By playing with the values in 2.1 and 2.2, you can increase the down-time to something more reasonable than the aggressive values I've got below. Bear in mind, that if the config is saved then the current instability value will be, as well. This could be solved multiple ways, like an initialization script, using contexts instead, etc. I couldn't see letting it get triggered more than once a minute or be down for more than a day, but those could obviously be modified as well.

event manager environment instability 0

event manager applet HealthMonitor
event syslog pattern "%BGP-5-NBR_RESET: Neighbor 10.10.23.3 reset" maxrun
87000 ratelimit 60
action 1.0 cli command "conf t"
action 1.1 cli command "router bgp 2"
action 1.2 cli command "neighbor 10.10.23.3 shutdown"
action 2.1 increment instability 1
action 2.2 multiply $instability 60
action 2.3 set punishment "$_result"
action 2.4 if $punishment gt "86400"
action 2.5 set punishment "86400"
action 2.6 end
action 2.7 syslog msg "Punishment is $punishment"
action 3.1 while $punishment ge 1
action 3.3 wait 1
action 3.4 decrement punishment 1
action 3.5 end
action 4.0 syslog msg "Shutdown timer of $punishment elapsed. Re-enabling peer."
action 4.1 cli command "no neighbor 10.10.23.3 shutdown"
action 4.2 cli command "event manager environment instability $instability"
action 4.3 cli command "end"

Bear in mind, pinging things outside of your administrative control makes the behavior of your network beholden to an external entities security policies, which likely don't take such things into account.

-Jon

-----Original Message-----
From: nobody_at_groupstudy.com [mailto:nobody_at_groupstudy.com] On Behalf Of Jay McMickle
Sent: Thursday, August 15, 2013 10:01 PM
To: Christopher Rae
Cc: Joseph L. Brunner; jon.hartman_at_verizon.net; mathewfer_at_gmail.com; marco207p_at_gmail.com; jneiberger_at_gmail.com; ccielab_at_groupstudy.com
Subject: Re: EEM to keep BGP peer shut during an interface flap

Joe is always trying to kick over rocks. Ignore him, it's a Jeckle and Hyde thing. ;) BWhahahahaahaa!
Oh, and I think he meant CCIE lab rat, not rate.

Yes, Jon knows BGP well, and made me feel very small in our CCIE training class in Sept 2011.

Regards,
Jay McMickle- 2x CCIE #35355 (R/S,Sec)
Sent from my iPhone 5

On Aug 15, 2013, at 10:15 AM, Christopher Rae <chris.rae07_at_me.com> wrote:

> Whats a lab rate ccie?
>
> Cheers
> Chris Rae
>
> On 15/08/2013, at 11:08 PM, "Joseph L. Brunner"
> <joe_at_affirmedsystems.com>
wrote:
>
>> Another lab rate ccie :)
>>
>> Cause Jon,
>>
>> ISP are often useless post office style entities. We often cant rely
>> on them for much. In my experience (500+ bgp implementations with a
>> dual homed site or colo) the carriers can do things like freeze up,
>> so you have to wait the keepalive and dead times before the secondary
>> route(s) take over. BFD? I have not seen an isp offer that. We have
>> Windstream (Paetec), TWC, Level3, Transbeam and Cogent to choose from
>> here in NYC. I have a hard enough time just getting the peering
>> session setup (one of those carrier's noc guy needed a config, I kid
>> you not)
>>
>> EEM can also send you an email when bad things happen before your
>> users
(or boss) comes and tells you...
>>
>> Also, fast external failover is often useless. We are in the ethernet
society... That feature was designed 15 years ago in the era of hdlc and ppp connections - like a DS3/T3. Your interface will almost never go down when your ethernet isp is "down". I know on my Level3 connections there are 2 alcatel lucent boxes between us and the juniper router actually doing the bgp. No chance that will help.
>>
>> EEM is your final control of how the router functions under different
>> bgp
and other conditions. Don't leave home without it...
>>
>>
>> ----- Original Message -----
>> From: Jon Hartman [mailto:jon.hartman_at_verizon.net]
>> Sent: Thursday, August 15, 2013 10:40 AM
>> To: Christopher Rae <chris.rae07_at_me.com>
>> Cc: Mathew <mathewfer_at_gmail.com>; Joe Sanchez <marco207p_at_gmail.com>;
>> Joseph L. Brunner; John Neiberger <jneiberger_at_gmail.com>; Cisco
>> certification <ccielab_at_groupstudy.com>
>> Subject: Re: EEM to keep BGP peer shut during an interface flap
>>
>> I'd have to think that features like BFD, bgp fast failover,
>> interface
dampening, and BGP dampening would accommodate the issue at hand.
>>
>> Why the requirement to use EEM?
>>
>> Jon Hartman
>> CCIE #34941
>>
>> On Aug 15, 2013, at 4:14 AM, "Christopher Rae" <chris.rae07_at_me.com>
wrote:
>>
>>> Hey Joseph,
>>>
>>> Yes, had BFD running with a few providers no worries.
>>>
>>> Cheers
>>> Chris
>>>
>>> -----Original Message-----
>>> From: nobody_at_groupstudy.com [mailto:nobody_at_groupstudy.com] On Behalf
>>> Of Mathew
>>> Sent: Thursday, August 15, 2013 3:47 PM
>>> To: Joe Sanchez
>>> Cc: Joseph L. Brunner; John Neiberger; Chris Rae; Cisco
>>> certification
>>> Subject: Re: EEM to keep BGP peer shut during an interface flap
>>>
>>> Hi,
>>>
>>> I just tried the below but I could not get it to work. The idea is
>>> to ping an IP and depending on the result to take action.
>>>
>>> I think line "action 11.2 regexp "(.*) (!\!\!\!\!) (.*)"
>>> "$_cli_result" _match _sub1" is NOT correct.
>>> As I am still building this applet, I run this manually.
>>>
>>> How do I get this regular expression correctly to match ping result?
>>>
>>> R2#show event manager version | in Event Manager Version Embedded
>>> Event Manager Version 3.00 R2#
>>>
>>> !
>>> event manager applet CHECK-PING-STATUS event none action 11.1 cli
>>> command "ping 2.2.2.2"
>>> action 11.2 regexp "(.*) (!\!\!\!\!) (.*)" "$_cli_result" _match
>>> _sub1 action 11.3 if $_regexp_result eq 1 action 11.4 syslog msg
>>> "Ping is success"
>>> action 11.5 else
>>> action 11.6 syslog msg "Ping is failed"
>>> action 11.7 end
>>> !
>>>
>>> Mathew
>>>
>>> On Wed, Aug 14, 2013 at 11:09 PM, Joe Sanchez <marco207p_at_gmail.com>
wrote:
>>>> Level 3 will as long as your're homed to the right gateway boxes.
>>>>
>>>> Regards,
>>>> Joe Sanchez
>>>>
>>>> ( please excuse the brevity of this email as it was sent via a
>>>> mobile device. Please excuse misspelled words or sentence
>>>> structure.)
>>>>
>>>> On Aug 14, 2013, at 3:26 AM, "Joseph L. Brunner"
>>>> <joe_at_affirmedsystems.com>
>>> wrote:
>>>>
>>>>> I have never seen an ISP that will run BFD with any customers...
>>>>> they seem to have enough issues just getting basic bgp setup
>>>>> (cogent
>>>>> anyone?)
>>>>>
>>>>> How about an EEM solution that shuts down bgp for a few hours and
>>>>> turns it back on aftermarket hours? Yes it works... we use it :)
>>>>>
>>>>> kbro-voip-rt01#show run | sec event
>>>>>
>>>>> event manager directory user policy "flash:/"
>>>>> event manager policy sendmail.tcl
>>>>>
>>>>> event manager applet ShutdownCohereBGPNeighbor event track 10
>>>>> state down action 1.0 info type routername action 2.0 cli command
"enable"
>>>>> action 2.1 cli command "configure terminal"
>>>>> action 2.5 cli command "router bgp 65080"
>>>>> action 2.6 cli command "neighbor 208.71.93.213 shutdown"
>>>>> action 3.0 mail server "outbounds9.obsmtp.com" to
>>> "kbro-notif_at_affirmedsystems.com" from "kbro-voip-rt01_at_kbro.com"
>>> subject "Cohere VoIP Direct route down @ $_info_routername"
>>>>>
>>>>> event manager applet EnableCohereat8PM event timer cron name
>>>>> EnableCohereat8PM cron-entry "0 20 * * *"
>>>>> action 1.0 info type routername
>>>>> action 2.0 cli command "enable"
>>>>> action 2.1 cli command "configure terminal"
>>>>> action 2.5 cli command "router bgp 65080"
>>>>> action 2.6 cli command "no neighbor 208.71.93.213 shutdown"
>>>>>
>>>>> event manager applet NoShutCohere805PM event tag 1.0 track 10
>>>>> state up event tag 2.0 timer cron name NoShutCohere805PM
>>>>> cron-entry "5 20 *
>>>>> * *"
>>>>> trigger occurs 1 delay 10
>>>>> correlate event 1.0 and event 2.0
>>>>> attribute tag 1.0 occurs 1
>>>>> attribute tag 2.0 occurs 1
>>>>> action 1.0 info type routername
>>>>> action 2.0 cli command "enable"
>>>>> action 2.1 cli command "configure terminal"
>>>>> action 2.5 cli command "router bgp 65080"
>>>>> action 2.6 cli command "no neighbor 208.71.93.213 shutdown"
>>>>> action 2.7 cli command "do clear ip nat translation *"
>>>>> action 3.0 mail server "outbounds9.obsmtp.com" to
>>> "kbro-notif_at_affirmedsystems.com" from "kbro-voip-rt01_at_kbro.com"
>>> subject "Cohere VoIP Direct route restored @ $_info_routername"
>>>>>
>>>>>
>>>>> event manager applet EnableCohereat7AM event timer cron name
>>>>> EnableCohereat7AM cron-entry "0 7 * * *"
>>>>> action 1.0 info type routername
>>>>> action 2.0 cli command "enable"
>>>>> action 2.1 cli command "configure terminal"
>>>>> action 2.5 cli command "router bgp 65080"
>>>>> action 2.6 cli command "no neighbor 208.71.93.213 shutdown"
>>>>>
>>>>> event manager applet KeepNoShutCohere705AM event tag 1.0 track 10
>>>>> state up event tag 2.0 timer cron name KeepNoShutCohere705AM
>>>>> cron-entry "5 7 * * *"
>>>>> trigger occurs 1 delay 10
>>>>> correlate event 1.0 and event 2.0
>>>>> attribute tag 1.0 occurs 1
>>>>> attribute tag 2.0 occurs 1
>>>>> action 1.0 info type routername
>>>>> action 2.0 cli command "enable"
>>>>> action 2.1 cli command "configure terminal"
>>>>> action 2.5 cli command "router bgp 65080"
>>>>> action 2.6 cli command "no neighbor 208.71.93.213 shutdown"
>>>>> action 2.7 cli command "do clear ip nat translation *"
>>>>> action 3.0 mail server "outbounds9.obsmtp.com" to
>>> "kbro-notif_at_affirmedsystems.com" from "kbro-voip-rt01_at_kbro.com"
>>> subject "Cohere VoIP Direct route restored @ $_info_routername"
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: nobody_at_groupstudy.com [mailto:nobody_at_groupstudy.com] On
>>>>> Behalf Of John Neiberger
>>>>> Sent: Tuesday, August 13, 2013 12:12 PM
>>>>> To: Chris Rae
>>>>> Cc: Mathew; Cisco certification
>>>>> Subject: Re: EEM to keep BGP peer shut during an interface flap
>>>>>
>>>>> This. Exactly. Use BFD for this. It already does what you're
>>>>> trying to do
>>> and it's a heck of a lot easier to configure.
>>>>>
>>>>>
>>>>> On Tue, Aug 13, 2013 at 6:53 AM, Chris Rae <chris.rae07_at_me.com> wrote:
>>>>>
>>>>>> Hey Matt,
>>>>>>
>>>>>> Why not just use BFD?
>>>>>> If the BFD peer is down (ie no keep alive or interface goes down)
>>>>>> BGP will immediately reroute via other peer.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> On 13/08/2013, at 7:52 PM, Mathew <mathewfer_at_gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I tested two EEM applet configs:
>>>>>>>
>>>>>>> - One check for syslog for an interface down and CLI to shut
>>>>>>> down BGP
>>>>>> peer.
>>>>>>> - Second one to no shut the BGP peer when syslog entry is seen
>>>>>>> with interface up.
>>>>>>>
>>>>>>> In fact the interface that I want to check is NOT being used for
>>>>>>> this BGP peering so there is no way to do it with BGP configuration.
>>>>>>>
>>>>>>> The above two EEM configs works but the issue is that when this
>>>>>>> interface start to flap, EEM keep shutting and no-shutting BGP peer.
>>>>>>> I want to
>>>>>> avoid
>>>>>>> this as it results in BGP flap.
>>>>>>>
>>>>>>> Has any body tried an EEM solution to keep the BGP peer shut
>>>>>>> during an interface flap?
>>>>>>>
>>>>>>> I do not mind keeping the BGP shut till interface flapping is
>>>>>>> over but
>>>>>> how
>>>>>>> do we do/detect it with EEM?
>>>>>>>
>>>>>>> Thanks in advance for your replies.
>>>>>>>
>>>>>>> Mathew
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks
>>>>>>>
>>>>>>> Mathew
>>>>>>>
>>>>>>>
>>>>>>> Blogs and organic groups at http://www.ccie.net
>>>>>>>
>>>>>>> ________________________________________________________________
>>>>>>> ___ _ ___ Subscription information may be found at:
>>>>>>> http://www.groupstudy.com/list/CCIELab.html
>>>>>>
>>>>>>
>>>>>> Blogs and organic groups at http://www.ccie.net
>>>>>>
>>>>>> _________________________________________________________________
>>>>>> ___ __ _ Subscription information may be found at:
>>>>>> http://www.groupstudy.com/list/CCIELab.html
>>>>>
>>>>>
>>>>> Blogs and organic groups at http://www.ccie.net
>>>>>
>>>>> __________________________________________________________________
>>>>> ___ __ Subscription information may be found at:
>>>>> http://www.groupstudy.com/list/CCIELab.html
>>>>>
>>>>>
>>>>> Blogs and organic groups at http://www.ccie.net
>>>>>
>>>>> __________________________________________________________________
>>>>> ___ __ Subscription information may be found at:
>>>>> http://www.groupstudy.com/list/CCIELab.html
>>>
>>>
>>>
>>> --
>>> Thanks
>>>
>>> Mathew
>>>
>>>
>>> Blogs and organic groups at http://www.ccie.net
>>>
>>> ____________________________________________________________________
>>> ___ Subscription information may be found at:
>>> http://www.groupstudy.com/list/CCIELab.html
>>>
>>>
>>> Blogs and organic groups at http://www.ccie.net
>>>
>>> ____________________________________________________________________
>>> ___ Subscription information may be found at:
>>> http://www.groupstudy.com/list/CCIELab.html
>>
>>
>> Blogs and organic groups at http://www.ccie.net
>>
>> _____________________________________________________________________
>> __ Subscription information may be found at:
>> http://www.groupstudy.com/list/CCIELab.html
>
>
> Blogs and organic groups at http://www.ccie.net
>
> ______________________________________________________________________
> _ Subscription information may be found at:
> http://www.groupstudy.com/list/CCIELab.html

Blogs and organic groups at http://www.ccie.net
Received on Thu Aug 29 2013 - 07:49:51 ART

This archive was generated by hypermail 2.2.0 : Sun Sep 01 2013 - 08:35:51 ART