From: Petr Lapukhov (petr@internetworkexpert.com)
Date: Sun Jul 13 2008 - 04:22:12 ART
John,
WFQ may look a little bit tricky, especially if you're digging in the
implementation details right away :) However, let's start from the very
basics, looking at the original roots of WFQ/FQ family of schedulers.
First, it all started with an abstract algorithm called GPS (Generalized
Processor Sharing) which describes and ideal procedure to share one resource
between many entities in a *statistically fair* and efficient manner. Here
"Fair" means so-called "max min" fairness, which maximizes the minimal share
of any of the active entities that is tries to give as much as possible to
the most starving entity. Additionally you may assign "weights" to entities,
to scale their importance. (Fairness, in general, is very complicated topic,
rooted in Games Theory, which raised a lot of heated discussions during
times. "Max Min" fairness is just one scheme proposed).
However, GPS is an abstract algorithm, which could not be implemented in
real world due to the fact that it interprets the "resource shares"
(packets, when applied to networking) as infinitesimally divisible, and not
as quantum's. Therefore, an "approximation" scheme is needed to make it work
in real world. A large number of implementations simulating GPS behavior
exist, including: Fair Queuing (FQ), Weighted Fair Queuing (WFQ), Round
Robin (RR), Weighted Round Robin (WRR), Deficit Round Robin (DRR), Modified
Deficit Round Robin (MDRR) and so on. All of them represent more or less
accurate "packet-based" approximations of GPS.
First, look at Fair-Queuing (FQ). The algorithm was proposed by infamous guy
named John Nagle (back in 80s), who is also responsible for that *service
nagle* command in IOS ;) The central concept of FQ is packet flow
(equivalent of a process in GPS), which could be defined in various ways.
For example, you may define a flow as a set of packets sharing the same
source/destination IP addresses. Cisco decided to define a flow based on
source/destination IPs, sourced/destination ports (where available), IP
protocol value and least 5 bits of TOS byte field (ouch, this is
complicated). They use a special hash-function that runs over all the values
and returns the queue-number to put packet in (It may be easily shown, that
basing flow on port number may allow one user with a lot of flows to claim
more bandwidth than other users can get. This is very true, with respect to
P2P applications and that is why the guys on IETF try to invent a new
fairness scheme, which works well in P2P sharing world).
The second concepts of FQ is per-flow queue - each flow has it's own special
FIFO queue in a memory space used by FQ scheduler. Therefore, the number of
active queues for FQ is dynamic changes with the number of active flows,
but shares the same buffer space assigned to a Fair Queue scheduler. Note
that the structure of the hash function used by Cisco, dictates that the
number of queues is always a power of 2: 256, 512, 1024, 4096 etc.
To decide which of the FIFO queues to serve first, FQ scheduler calculates
so-called "virtual finishing time" for packets sitting in the head of every
FIFO queue. This value is based on the packet size and the packet arrival
time. The packet with minimum expected "virtual finishing time" is de-queued
first, and this is how a flow with small packets is able to get the same
share as the flow with large packets. The ultimate goal of FQ is to divide
interface bandwidth equally between flows, i.e. for N flows and interface
rate R, every flow will statistically obtain bandwidth no less than R/N (of
course a flow may use less bandwidth, but R/N is guaranteed in case a flow
will speed up). Note that formula is only valid for sustained flows
(statistically reliable) and you can get less accurate results with short,
dynamic flows.
Next, WFQ (Weighted Fair Queuing) is a more general technique, which assigns
weights to packet flows, and shares the bandwidth in proportion to weights.
Whether FQ provides R/N share of rate R to N flows, WFQ will provide
wX*R/(w1+w2+wN) [formula 1] share of bandwidth to flow X. Cisco's
implementation assigns weights automatically, based on a flow's IP
precedence value (this is why top 3 bit of TOS byte are not taken in account
by WFQ hashing function). You may consider w1, w2 wN to equal "IP
Precedence + 1" for use with the above mentioned "formula 1" in actual
computation of bandwidth shares. Obviously the formula means that if you
have just one flow of each precedence value, the bandwidth is shared in
proportions of "IP Precedence +1" (e.g. for one ip precedence 0 flow and one
IP precedence 1 flow you will get share 1:2). For a more complicated
example, consider an interface of 256Kbps with five precedence 1 flows, plus
three precedence 2 flows and one precedence 5 flow on the same link. The
bandwidth shares will be as following.
Total number of flows (N) equals to: 5+3+1=9.
Weights:
w1=1, w2=1,w3=1,w4=1,w5=1
w6=2, w7=2, w8=2
w3=5
Shares per formula 1:
for each precedence 1 flow: 1/(5*1+3*2+1*5)=1/16
for each precedence 2 flow: 2/(5*1+3*2+1*5)=1/8
for the precedence 5 flow: 5/(5*1+3*2+1*5)=5/16
And if the interface bandwidth is 256K the resulting values are 16Kbps for
each precedence 1 flow, 32 Kbps for each precedence 2 flow and 90Kbps for
precedence 5 flow.
However, in the actual implementation Cisco uses special "scale factor" to
multiply the "virtual finishing time" for each flow and to achieve that
weighted fairness. Prior to IOS 12.0(5)T, the formula for the "scale factor"
was: 4096/(IP Precedence + 1) and the modern IOSes use 32768/(IP Precedence
+ 1). As you can see, this scale factor is larger for smaller IP
precedence's so a flow with bigger IP precedence will have "virtual
finishing time" scaled less that a flow with smaller IP precedence value.
However, those are only the implementation details, and what's important to
note is that to get the bandwidth shares you may always use "formula 1".
A few words on RSVP interaction with WFQ. WFQ was designed to be RSVP aware,
and every RSVP reservation gets a separate flow (FIFO queue) served by WFQ
scheduler. The "scaling factors" for RSVP flows are very small, compared to
the values calculated for the regular flows. For example the RSVP
reservation with largest bandwidth demand will get a scaling factor of 6,
and every other reservation gets higher scaling factor based on it's
bandwidth demand but also very small. This is how RSVP flows always get the
bandwidth they are asking for (provided that they are admitted by RSVP
process).
Next thing, which is specific to Cisco's implementation of WFQ, is so-called
Congestive Discard Threshold. As we remember, WFQ shares a single buffer
between dynamic flows. Each queue is FIFO and may eventually eat all
available buffer space. To avoid this condition, CDT used in the following
manner. When one of the queues reaches CDT size, a packet is dropped, but
possibly not from the oversized queue. The actual queue that has a packet
dropped is the most "active" queue of them all the one accumulated the
most bytes in packets and having the biggest total virtual finishing time.
This way (hoping that the flow was TCP) the most active flow is implicitly
signaled to slow down it's sending rate. This is not a part of original FQ
or WFQ algorithm, but Cisco's extension to make it more adaptive. Here is
how WFQ settings may look like on IOS router:
"fair-queue <CDT> <NumDynamicQueues> <RSVP Queues>". The total buffer space
available to WFQ is set with "hold-queue" command.
Another two extensions to WFQ were IP RTP Reserve and IP RTP Priority (the
latter is also known as PQ/WFQ). Those two specifically aim to make WFQ work
with voice traffic, which requires expedite priority treatment for voice
bearer packets. IP RTP Reserve used a special flow with a very small scaling
factor (similar to RSVP queues and also based on bandwidth provided) to
queue all RTP packets (actually packets with UDP ports in the range you
specify) but was deprecated by IP RTP Priority. The latter one provides true
priority queue (a separate FIFO flow queue) for RTP traffic on interface
with WFQ enabled. RTP packets are always serviced first, but limited in
their rate using an explicit policing rate (note that the burst size for
this special policer equals to specified rate multiplied by one second). The
commands for IP RTP Reserve and IP RTP Priority are "ip rtp reserve 16384
16383 256" and "ip rtp priority 16384 16383 256".
The last thing about WFQ is that it reserves special number of flow-queue to
network management traffic, such as link-layer keep-alives, routing protocol
traffic, netflow packets etc. You can't see those special flow queues in
show commands output, but they do exist and have special high weights to
guarantee timely delivery to critical network traffic.
Of course, all WFQ features were superseded by CBWFQ, which not only allows
you to specify any criteria for traffic flow and assign weights manually,
but also is also tightly integrates with other QoS features using common MQC
syntax. At his time, I'm working on a more detailed post to for our blog,
describing WFQ mechanics in depth. So if you are looking for more
information on WFQ and CBWFQ, stay tuned :)
--Petr Lapukhov, CCIE #16379 (R&S/Security/SP/Voice) petr@internetworkexpert.com
Internetwork Expert, Inc. http://www.InternetworkExpert.com
2008/7/13 John <jgarrison1@austin.rr.com>:
> I don't know why I have a hard time getting this, but lets see if I have > this > right. > > WFQ will give low volume traffic "better treatment" then high volume > traffic. > It does this through some mysterious algorithm that I can't seem to find. > WFQ > is the default for all interfaces under E1 speeds. WFQ works with ip > precedence and RSVP. WFQ is flow based. I cannot change the behaviour of > WFQ > through configuration. There are some QOS features I cannot configure on > an > interface with WFQ. > > Thats what I get. I would appreciate comments on anything I missed or got > wrong > > > ______________________________ _________________________________________ > Subscription information may be found at: > http://www.groupstudy.com/list/CCIELab.html
This archive was generated by hypermail 2.1.4 : Mon Aug 04 2008 - 06:11:54 ART