From patchwork Tue Jan 12 00:39:55 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Holmes X-Patchwork-Id: 59587 Delivered-To: patch@linaro.org Received: by 10.112.130.2 with SMTP id oa2csp2435420lbb; Mon, 11 Jan 2016 16:40:41 -0800 (PST) X-Received: by 10.140.173.132 with SMTP id t126mr176566804qht.96.1452559241019; Mon, 11 Jan 2016 16:40:41 -0800 (PST) Return-Path: Received: from lists.linaro.org (lists.linaro.org. [54.225.227.206]) by mx.google.com with ESMTP id d138si123352343qhd.108.2016.01.11.16.40.40; Mon, 11 Jan 2016 16:40:40 -0800 (PST) Received-SPF: pass (google.com: domain of lng-odp-bounces@lists.linaro.org designates 54.225.227.206 as permitted sender) client-ip=54.225.227.206; Authentication-Results: mx.google.com; spf=pass (google.com: domain of lng-odp-bounces@lists.linaro.org designates 54.225.227.206 as permitted sender) smtp.mailfrom=lng-odp-bounces@lists.linaro.org; dkim=neutral (body hash did not verify) header.i=@linaro.org Received: by lists.linaro.org (Postfix, from userid 109) id 5B35C61717; Tue, 12 Jan 2016 00:40:40 +0000 (UTC) Authentication-Results: lists.linaro.org; dkim=fail reason="verification failed; unprotected key" header.d=linaro.org header.i=@linaro.org header.b=IBatckP6; dkim-adsp=none (unprotected policy); dkim-atps=neutral X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on ip-10-142-244-252 X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_LOW,RCVD_IN_MSPIKE_H2,T_DKIM_INVALID,URIBL_BLOCKED autolearn=disabled version=3.4.0 Received: from [127.0.0.1] (localhost [127.0.0.1]) by lists.linaro.org (Postfix) with ESMTP id E0F2D61703; Tue, 12 Jan 2016 00:40:24 +0000 (UTC) X-Original-To: lng-odp@lists.linaro.org Delivered-To: lng-odp@lists.linaro.org Received: by lists.linaro.org (Postfix, from userid 109) id 606E861703; Tue, 12 Jan 2016 00:40:18 +0000 (UTC) Received: from mail-qg0-f47.google.com (mail-qg0-f47.google.com [209.85.192.47]) by lists.linaro.org (Postfix) with ESMTPS id 41C8A6170E for ; Tue, 12 Jan 2016 00:40:09 +0000 (UTC) Received: by mail-qg0-f47.google.com with SMTP id e32so328171748qgf.3 for ; Mon, 11 Jan 2016 16:40:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=cb8joMvMMYgSdW0Ox1WNuaCNp7JWdTgXL3dJlWrv9iU=; b=IBatckP6xpt4yk8GALCm9cctukYw2PYNL3x14jOg+le9QCaVfV/e3mfo6AIh3KXeue R2Fgbg4n24dKaAKEWPI/hQgA9xWxqb5q/XiihmvJtZhd3K4NbXV0y9iqJ20DOY8yqVgD ZlLQoAjr6KlPS7DLiSgfTPhE6Se++WkTzkMBI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=cb8joMvMMYgSdW0Ox1WNuaCNp7JWdTgXL3dJlWrv9iU=; b=BrUoeHjnRpRQ6KgLXLxKz70a9v66JvyrvrM4h7JO+KsrvxfjwTI6YK+ruyXDszchb5 4jZa9sCBO3aeM4uH2CakrY9hBoqyD2AGwGg/0spOIsmnzaay3P0Aj+CFp3rjkxh1SWFo 3aPlLW7ak+yS1x7NM1wQNLSxL02iD5xthRV9MB14y+BaxQGzIu8oPz+pGB8YFZmzqoDG FCNwkuYvbxx2pNKDlegoAwRM2+lLRJCcvKsALCPbpCK8niH5YXBetsgJDaUI2NQC2GQk ndxPzXWUFigR0KmRzdNGco8/P1+uAdgatERNrdlNckFETpgEzvl2kiYHH2cEOhP5VlH+ ShaQ== X-Gm-Message-State: ALoCoQnByBGW2Cpnewm7CFeYvypYFhOFv6zjuxbhdOF0v4/VbN0RFnAGpv5QpMvsOzQ3r6D3fKtLmXHtA1f1i+Oph4Ej/idyQQ== X-Received: by 10.140.145.196 with SMTP id 187mr153016941qhr.15.1452559208887; Mon, 11 Jan 2016 16:40:08 -0800 (PST) Received: from localhost.localdomain (c-98-221-136-245.hsd1.nj.comcast.net. [98.221.136.245]) by smtp.gmail.com with ESMTPSA id f34sm3224987qgf.42.2016.01.11.16.40.08 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 11 Jan 2016 16:40:08 -0800 (PST) From: Mike Holmes To: lng-odp@lists.linaro.org Date: Mon, 11 Jan 2016 19:39:55 -0500 Message-Id: <1452559195-884-2-git-send-email-mike.holmes@linaro.org> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1452559195-884-1-git-send-email-mike.holmes@linaro.org> References: <1452559195-884-1-git-send-email-mike.holmes@linaro.org> X-Topics: patch Subject: [lng-odp] [API-NEXT PATCH v2 2/2] doc: users: migrate TM from API to users doc X-BeenThere: lng-odp@lists.linaro.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: "The OpenDataPlane \(ODP\) List" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: lng-odp-bounces@lists.linaro.org Sender: "lng-odp" Signed-off-by: Mike Holmes --- v2: Directly use svg in the documentation and dont generate a png since modern browsers can work with .svg files doc/users-guide/Makefile.am | 2 +- doc/users-guide/users-guide-tm.adoc | 264 ++++++++++++++++++++++++++++++++++++ doc/users-guide/users-guide.adoc | 2 + include/odp/api/traffic_mngr.h | 237 -------------------------------- 4 files changed, 267 insertions(+), 238 deletions(-) create mode 100644 doc/users-guide/users-guide-tm.adoc diff --git a/doc/users-guide/Makefile.am b/doc/users-guide/Makefile.am index 383894a..d3a84c6 100644 --- a/doc/users-guide/Makefile.am +++ b/doc/users-guide/Makefile.am @@ -4,7 +4,7 @@ EXTRA_DIST = users-guide.adoc all-local: $(TARGET) -$(TARGET): users-guide.adoc +$(TARGET): users-guide.adoc users-guide-tm.adoc @mkdir -p $(top_srcdir)/doc/output asciidoc -a data-uri -b html5 -a icons -a toc2 -a max-width=55em --out-file=$@ $< diff --git a/doc/users-guide/users-guide-tm.adoc b/doc/users-guide/users-guide-tm.adoc new file mode 100644 index 0000000..dc0d003 --- /dev/null +++ b/doc/users-guide/users-guide-tm.adoc @@ -0,0 +1,264 @@ +== Traffic Manager \(TM) + +The TM subsystem is a general packet scheduling system that accepts +packets from input queues and applies strict priority scheduling, weighted fair +queueing scheduling and/or bandwidth controls to decide which input packet +should be chosen as the next output packet and when this output packet can be +sent onwards. + +A given platform supporting this TM API could support one or more pure hardware +based packet scheduling systems, one or more pure software based systems or one +or more hybrid systems - where because of hardware constraints some of the +packet scheduling is done in hardware and some is done in software. In +addition, there may also be additional API's beyond those described here for: + +- controlling advanced capabilities supported by specific hardware, software +or hybrid subsystems +- dealing with constraints and limitations of +specific implementations. + +The intention here is to be the simplest API that covers the vast majority of +packet scheduling requirements. + +Often a TM subsystem's output(s) will be directly connected to a device's +physical (or virtual) output interfaces/links, in which case sometimes such a +system will be called an Egress Packet Scheduler or an Output Link Shaper, +etc.. While the TM subsystems configured by this API can be used in such a +way, this API equally well supports the ability to have the TM subsystem's +outputs connect to other TM subsystem input queues or general software queues +or even some combination of these three cases. + +=== TM Algorithms + +The packet scheduling/dropping techniques that can be applied to input +traffic include any mixture of the following: + +- Strict Priority scheduling. +- Weighted Fair Queueing scheduling (WFQ). +- Bandwidth Shaping. +- Weighted Random Early Discard (WRED). + +Note that Bandwidth Shaping is the only feature that can cause packets to be +"delayed", and Weighted Random Early Discard is the only feature (other than +input queues becoming full) that can cause packets to be dropped. + +==== Strict Priority Scheduling + +Strict Priority Scheduling (or just priority for short), is a technique where input +queues and the packets from them, are assigned a priority value in the range 0 +.. ODP_TM_MAX_PRIORITIES - 1. At all times packets the the smallest priority +value will be chosen ahead of packets with a numerically larger priority value. +This is called strict priority scheduling because the algorithm strictly +enforces the scheduling of higher priority packets over lower priority +packets. + +==== Bandwidth Shaping + +Bandwidth Shaping (or often just Shaping) is the term used here for the idea of +controlling packet rates using single rate and/or dual rate token bucket +algorithms. For single rate shaping a rate (the commit rate) and a "burst +size" (the maximum commit count) are configured. Then an internal signed +integer counter called the _commitCnt_ is maintained such that if the _commitCnt_ +is positive then packets are eligible to be sent. When such a packet is +actually sent then its _commitCnt_ is decremented (usually by its length, but one +could decrement by 1 for each packet instead). The _commitCnt_ is then +incremented periodically based upon the configured rate, so that this technique +causes the traffic to be limited to the commit rate over the long term, while +allowing some ability to exceed this rate for a very short time (based on the +burst size) in order to catch up if the traffic input temporarily drops below +the commit rate. + +Dual Rate Shaping is designed to allow certain traffic flows to fairly send +more than their assigned commit rate when the scheduler has excess capacity. +The idea being that it may be better to allow some types of traffic to send +more than their committed bandwidth rather than letting the TM outputs be idle. +The configuration of Dual Rate Shaping requires additionally a peak rate and a +peak burst size. The peak rate must be greater than the related comls mit +rate, but the burst sizes have no similar constraint. Also for every input +priority that has Dual Rate shaping enabled, there needs to be an additional +equal or lower priority (equal or higher numeric priority value) assigned. +Then if the traffic exceeds its commit rate but not its peak rate, the +"excess" traffic will be sent at the lower priority level - which by the +strict priority algorithm should cause no degradation of the higher priority +traffic, while allowing for less idle outputs. + +==== Weighted Fair Queuing + +Weighted Fair Queuing (WFQ) is used to arbitrate amongst multiple input +packets with the same priority. Each input can be assigned a weight in the +range MIN_WFQ_WEIGHT..MAX_WFQ_WEIGHT (nominally 1..255) that affects the way +the algorithm chooses the next packet. If all of the weights are equal AND all +of the input packets are the same length then the algorithm is equivalent to a +round robin scheduling. If all of the weights are equal but the packets have +different lengths then the WFQ algorithm will attempt to choose the packet such +that inputs each get a fair share of the bandwidth - in other words it +implements a weighted round robin algorithm where the weighting is based on +frame length. + +When the input weights are not all equal and the input packet lengths vary then +the WFQ algorithm will schedule packets such that the packet with the lowest +"Virtual Finish Time" is chosen first. An input packet's Virtual Finish Time +is roughly calculated based on the WFQ object's base Virtual Finish Time when +the packet becomes the first packet in its queue plus its frame length divided +by its weight. +---- +virtualFinishTime = wfqVirtualTimeBase + (pktLength / wfqWeight) +---- + +In a system running at full capacity with no bandwidth limits - over the long +term - each input fan-in's average transmit rate will be the same fraction of +the output bandwidth as the fraction of its weight divided by the sum of all of +the WFQ fan-in weights. Hence larger WFQ weights result in better "service" +for a given fan-in. + +[source,c] +---- +totalWfqWeight = 0; +for (each fan-in entity - fanIn - feeding this WFQ scheduler) + totalWfqWeight += fanIn->sfqWeight; + +fanIn->avgTransmitRate = avgOutputRatefanIn->sfqWeight / totalWfqWeight; +---- + +==== Weighted Random Early Discard + +The Weighted Random Early Discard (WRED) algorithm deals with the situation +where an input packet rate exceeds some output rate (including the case where +Bandwidth Shaping limits some output rates). Without WRED enabled and +configured, the TM system will just implement a tail dropping scheme whereby +whichever packet is unlucky enough to arrive when an TM input queue is full +will be discarded regardless of priority or any other consideration. WRED +allows one to configure the system to use a better/fairer algorithm than simple +tail dropping. It works by measuring the "fullness" of various packet queues +and converting this percentage into a probability of random packet dropping +with the help of some configurable parameters. Then a random number is picked +and together with the drop probability, a decision is made to accept the packet +or drop it. A basic parameterization of WRED requires three parameters: + +- the maximum queue level (which could be either a maximum number of + packets or a maximum amount of memory (i.e. bytes/buffers) used), +- a starting threshold - which is a number in the range 0..100 + representing a percentage of the maximum queue level at which the + drop probability becomes non-zero, +- a drop probability - which is a number in the range 0..100 + representing a probability (0 means no drop and 100 means + certain drop) - which is used when the queue is near 100% full. + +Note that all packet drops for a TM system only occur when a new packet arrives +at a given TM system input queue. At that time either the WRED algorithm, if +enabled for this input queue, or the "input queue full" tail drop algorithm +will make a drop/no drop decision. After this point, any packets not dropped, +will at some point be sent out a TM output - assuming that the topology is +fully connected and enabled. + +=== Hierarchical Scheduling and tm_nodes + +This API supports the ability to do Hierarchical Scheduling whereby the +final scheduling decision is controlled by equal priority schedulers, +strict priority multiplexers, bandwidth shapers - at multiple levels - all +forming a tree rooted at a single egress object. In other words, all +tm_queues and tm_nodes have the property that their logical "output" feeds +into one fan-in of a subsequent tm_node or egresss object - forming a proper +tree. + +.Hierarchical Scheduling +image::../images/tm_hierarchy.svg[align="center"] + +Multi-level/hierarchical scheduling adds both great control and significant +complexity. Logically, despite the implication of the tm_node tree diagrams, +there are no queues between the levels of hierarchy. Instead all packets are +held in their input queue, until such time that the totality of all of the +tm_nodes in the single path from input queue to output object agrees that this +packet should be the next to be chosen to leave the TM system through the +output object "portal". Hence what flows from level to level is the "local +choice" of what packet/tm_queue should next be serviced. + +==== tm_nodes + +Tm_nodes are the main "entity"/object that a TM system is composed of. Each +tm_node is a mini-TM subsystem of its own, but the interconnection and +interplay of a multi-level "tree" of tm_nodes can allow the user to specify +some very sophisticated behaviours. Each tm_node can contain a set of scheduler +(one per strict priority level), a strict priority multiplexer, a bandwidth +shaper and a WRED component - or a subset of these. + +.Traffic Manager Node +image::../images/tm_node.svg[align="center"] + +In its full generality an tm_node consists of a set of "fan-in" connections to +preceding tm_queues or tm_nodes. The fan-in for a single tm_node can range +from 1 to many many thousands. This fan-in is divided first into a WFQ +scheduler per priority level. So if 4 priority levels are implemented by this +tm_node, there would be 4 WFQ schedulers - each with its own unique fan-in. +After the WFQ schedulers a priority chooser comes next - where it will always +choose the highest priority WFQ output available. The output of the priority +chooser then feeds a bandwidth shaper function which then finally uses the +shaper's propagation table to determine its output packet and its priority. +This output could then be remapped via a priority map profile and then becomes +one of the input fan-in to perhaps another level of tm_nodes, and so on. + +During this process it is important to remember that the bandwidth shaping +function never causes packets to be dropped. Instead all packet drops occur +because of tm_queue fullness or be running the WRED algorithm at the time a new +packet attempts to be appended to the end of some input queue. + +The WRED profile associated with an tm_node considers the entire set of +tm_queues feeding directly or indirectly into it as its measure of queue +fullness. + +==== tm_queues + +tm_queues are the second major type of "entity"/object that a TM system is +composed of. All packets MUST first enter the TM system via some tm_queue. +Then logically, the head packets of all of the tm_queues are examined +simultaneously by the entire TM system, and ONE tm_queue is chosen send its +head packet out of the TM system's egress. Abstractly packets stay in the +tm_queue until they are chosen at which time they are instantly transferred +from tm_queue to/through the corresponding TM egress. It is also important to +note that packets in the same tm_queue MUST always stay in order. In other +words, the second packet in an tm_queue must never leave the TM system through +a TM egress spigot before the first packet has left the system. So tm_queue +packet order must always be maintained. + +==== TM egress + +Note that TM egress objects are NOT referred to as queues, because in many/most +cases they don't have multi-packet structure but instead are viewed as a +port/spigot through which the TM system schedules and finally transfers input +packets through. + +=== Ideal versus Actual Behavior + +It is important to recognize the difference between the "abstract" mathematical +model of the prescribed behavior and real implementations. The model describes +the Ideal, but theoretically desired behavior, but such an Ideal is generally +not practical to implement. Instead, one understands that virtually all Real +TM systems attempt to approximate the Ideal behavior as given by the TM +configuration as best as they can - while still attaining high packet +processing performance. The idea is that instead of trying too hard to be +"perfect" at the granularity of say microseconds, it may be better to instead +try to match the long term Ideal behavior over a much more reasonable period of +time like a millisecond. It is generally better to have a stable +implementation that when averaged over a period of several milliseconds matches +the Ideal behavior very closely than to have an implementation that is perhaps +more accurate over a period of microseconds, but whose millisecond averaged +behavior drifts away from the Ideal case. + +=== Other TM Concepts + +==== Profiles + +This specification often packages related TM system parameters into +records/objects called profiles. These profiles can then be associated with +various entities like tm_nodes and tm_queue's. This way the amount of storage +associated with setting related parameters can be reduced and in addition it is +common to re-use the same set of parameter set over and over again, and also to +be able to change the parameter set once and have it affect lots of entities +with which it is associated with/applied to. + +==== Absolute Limits versus +odp_tm_capability_t+ + +This header file defines some constants representing the absolute maximum +settings for any TM system, though in most cases a TM system can (and should) +be created/instantiated with smaller values, since lower values will often +result in faster operation and/or less memory used. diff --git a/doc/users-guide/users-guide.adoc b/doc/users-guide/users-guide.adoc index 8723437..67c265f 100644 --- a/doc/users-guide/users-guide.adoc +++ b/doc/users-guide/users-guide.adoc @@ -789,6 +789,8 @@ in-place, when the output packet is the same as the input packet or the output packet can be a new packet provided by the application or allocated by the implementation from the session output pool. +include::users-guide-tm.adoc[] + == Glossary [glossary] worker thread:: diff --git a/include/odp/api/traffic_mngr.h b/include/odp/api/traffic_mngr.h index 2459a8b..09b13f2 100644 --- a/include/odp/api/traffic_mngr.h +++ b/include/odp/api/traffic_mngr.h @@ -39,243 +39,6 @@ extern "C" { * API's beyond those described here for (a) controlling advanced capabilities * supported by specific hardware, software or hybrid subsystems or (b) * dealing with constraints and limitations of specific implementations. - * The intention here is to be the simplest API that covers the vast majority - * of packet scheduling requirements. - * - * Often a TM subsystem's output(s) will be directly connected - * to a device's physical (or virtual) output interfaces/links, in which case - * sometimes such a system will be called an Egress Packet Scheduler or an - * Output Link Shaper, etc.. While the TM subsystems configured by this API - * can be used in such a way, this API equally well supports the ability to - * have the TM subsystem's outputs connect to other TM subsystem input queues - * or general software queues or even some combination of these three cases. - * - *

TM Algorithms

- * - * The packet scheduling/dropping techniques that can be applied to input - * traffic include any mixture of the following: - *
    - *
  1. Strict Priority scheduling. - *
  2. Weighted Fair Queueing scheduling (WFQ). - *
  3. Bandwidth Shaping. - *
  4. Weighted Random Early Discard (WRED). - *
- * Note that Bandwidth Shaping is the only feature that can cause packets - * to be "delayed", and Weighted Random Early Discard is the only feature - * (other than input queues becoming full) that can cause packets to be - * dropped. - * - *

Strict Priority Scheduling

- * Strict Priority Scheduling (or just priority for short), is a technique - * where input queues and the packets from them, are assigned a priority - * value in the range 0 .. ODP_TM_MAX_PRIORITIES - 1. At all times packets - * the the smallest priority value will be chosen ahead of packets with a - * numerically larger priority value. This is called strict priority - * scheduling because the algorithm strictly enforces the scheduling of - * higher priority packets over lower priority packets. - * - *

Bandwidth Shaping

- * Bandwidth Shaping (or often just Shaping) is the term used here for the - * idea of controlling packet rates using single rate and/or dual rate token - * bucket algorithms. For single rate shaping a rate (the commit rate) and - * a "burst size" (the maximum commit count) are configured. Then an - * internal signed integer counter called the commitCnt is maintained such - * that if the commitCnt is positive then packets are eligible to be sent. - * When such a packet is actually sent then its commitCnt is decremented - * (usually by its length, but one could decrement by 1 for each packet - * instead). The commitCnt is then incremented periodically based upon the - * configured rate, so that this technique causes the traffic to be limited - * to the commit rate over the long term, while allowing some ability to - * exceed this rate for a very short time (based on the burst size) in order - * to catch up if the traffic input temporarily drops below the commit rate. - * - * Dual Rate Shaping is designed to allow certain traffic flows to fairly - * send more than their assigned commit rate when the scheduler has excess - * capacity. The idea being that it may be better to allow some types of - * traffic to send more than their committed bandwidth rather than letting - * the TM outputs be idle. The configuration of Dual Rate Shaping requires - * additionally a peak rate and a peak burst size. The peak rate must be - * greater than the related comls mit rate, but the burst sizes have no similar - * constraint. Also for every input priority that has Dual Rate shaping - * enabled, there needs to be an additional equal or lower priority (equal or - * higher numeric priority value) assigned. Then if the traffic exceeds its - * commit rate but not its peak rate, the "excess" traffic will be sent at the - * lower priority level - which by the strict priority algorithm should - * cause no degradation of the higher priority traffic, while allowing for - * less idle outputs. - * - *

Weighted Fair Queuing

- * Weighted Fair Queuing (WFQ) is used to arbitrate amongst multiple input - * packets with the same priority. Each input can be assigned a weight in the - * range MIN_WFQ_WEIGHT..MAX_WFQ_WEIGHT (nominally 1..255) that affects the way - * the algorithm chooses the next packet. If all of the weights are equal AND - * all of the input packets are the same length then the algorithm is - * equivalent to a round robin scheduling. If all of the weights are equal - * but the packets have different lengths then the WFQ algorithm will attempt - * to choose the packet such that inputs each get a fair share of the - * bandwidth - in other words it implements a weighted round robin algorithm - * where the weighting is based on frame length. - * - * When the input weights are not all equal and the input packet lengths vary - * then the WFQ algorithm will schedule packets such that the packet with - * the lowest "Virtual Finish Time" is chosen first. An input packet's - * Virtual Finish Time is roughly calculated based on the WFQ object's base - * Virtual Finish Time when the packet becomes the first packet in its queue - * plus its frame length divided by its weight. - * @code - * virtualFinishTime = wfqVirtualTimeBase + (pktLength / wfqWeight) - * @endcode - * In a system running at full capacity with no bandwidth limits - over the - * long term - each input fan-in's average transmit rate will be the same - * fraction of the output bandwidth as the fraction of its weight divided by - * the sum of all of the WFQ fan-in weights. Hence larger WFQ weights result - * in better "service" for a given fan-in. - * @code - * totalWfqWeight = 0; - * for (each fan-in entity - fanIn - feeding this WFQ scheduler) - * totalWfqWeight += fanIn->sfqWeight; - * - * fanIn->avgTransmitRate = avgOutputRate * fanIn->sfqWeight / totalWfqWeight; - * @endcode - * - *

Weighted Random Early Discard

- * The Weighted Random Early Discard (WRED) algorithm deals with the situation - * where an input packet rate exceeds some output rate (including - * the case where Bandwidth Shaping limits some output rates). Without WRED - * enabled and configured, the TM system will just implement a tail dropping - * scheme whereby whichever packet is unlucky enough to arrive when an TM - * input queue is full will be discarded regardless of priority or any other - * consideration. - * WRED allows one to configure the system to use a better/fairer algorithm - * than simple tail dropping. It works by measuring the "fullness" of - * various packet queues and converting this percentage into a probability - * of random packet dropping with the help of some configurable parameters. - * Then a random number is picked and together with the drop probability, - * a decision is made to accept the packet or drop it. - * A basic parameterization of WRED requires three parameters: - *
    - *
  1. the maximum queue level (which could be either a maximum number of - * packets or a maximum amount of memory (i.e. bytes/buffers) used), - *
  2. a starting threshold - which is a number in the range 0..100 - * representing a percentage of the maximum queue level at which the - * drop probability becomes non-zero, - *
  3. a drop probability - which is a number in the range 0..100 - * representing a probability (0 means no drop and 100 means - * certain drop) - which is used when the queue is near 100% full. - *
- * - * Note that all packet drops for a TM system only occur when a new packet - * arrives at a given TM system input queue. At that time either the WRED - * algorithm, if enabled for this input queue, or the "input queue full" - * tail drop algorithm will make a drop/no drop decision. After this point, - * any packets not dropped, will at some point be sent out a TM output - - * assuming that the topology is fully connected and enabled. - * - *

Hierarchical Scheduling and tm_nodes

- * This API supports the ability to do Hierarchical Scheduling whereby the - * final scheduling decision is controlled by equal priority schedulers, - * strict priority multiplexers, bandwidth shapers - at multiple levels - all - * forming a tree rooted at a single egress object. In other words, all - * tm_queues and tm_nodes have the property that their logical "output" feeds - * into one fan-in of a subsequent tm_node or egresss object - forming a proper - * tree. See the following link - - * Example Tm_node - for an example. - * - * Multi-level/hierarchical scheduling adds both great control and significant - * complexity. Logically, despite the implication of the tm_node tree - * diagrams, there are no queues between the levels of hierarchy. Instead all - * packets are held in their input queue, until such time that the totality of - * all of the tm_nodes in the single path from input queue to output object - * agrees that this packet should be the next to be chosen to leave the TM - * system through the output object "portal". Hence what flows from level to - * level is the "local choice" of what packet/tm_queue should next be - * serviced. - * - *

tm_nodes

- * Tm_nodes are the main "entity"/object that a TM system is composed of. - * Each tm_node is a mini-TM subsystem of its own, but the interconnection - * and interplay of a multi-level "tree" of tm_nodes can allow the user - * to specify some very sophisticated behaviours. - * Each tm_node can contain a set of scheduler (one per strict priority level), - * a strict priority multiplexer, a bandwidth shaper and a WRED component - or - * a subset of these. - * - * In its full generality an tm_node consists of a set of "fan-in" connections - * to preceding tm_queues or tm_nodes. The fan-in for a single tm_node - * can range from 1 to many many thousands. This fan-in is divided first - * into a WFQ scheduler per priority level. So if 4 priority levels are - * implemented by this tm_node, there would be 4 WFQ schedulers - each with - * its own unique fan-in. After the WFQ schedulers a priority chooser comes - * next - where it will always choose the highest priority WFQ output - * available. The output of the priority chooser then feeds a bandwidth - * shaper function which then finally uses the shaper's propagation table - * to determine its output packet and its priority. This output could - * then be remapped via a priority map profile and then becomes one of the - * input fan-in to perhaps another level of tm_nodes, and so on. - * - * During this process it is important to remember that the bandwidth shaping - * function never causes packets to be dropped. Instead all packet drops - * occur because of tm_queue fullness or be running the WRED algorithm - * at the time a new packet attempts to be appended to the end of some - * input queue. - * - * The WRED profile associated with an tm_node considers the entire set of - * tm_queues feeding directly or indirectly into it as its measure of - * queue fullness. - * - *

tm_queues

- * tm_queues are the second major type of "entity"/object that a TM - * system is composed of. All packets MUST first enter the TM system via - * some tm_queue. Then logically, the head packets of all of the tm_queues - * are examined simultaneously by the entire TM system, and ONE tm_queue is - * chosen send its head packet out of the TM system's egress. Abstractly - * packets stay in the tm_queue until they are chosen at which time they are - * instantly transferred from tm_queue to/through the corresponding TM egress. - * It is also important to note that packets in the same tm_queue MUST always - * stay in order. In other words, the second packet in an tm_queue must never - * leave the TM system through a TM egress spigot before the first packet has - * left the system. So tm_queue packet order must always be maintained. - * - *

TM egress

- * Note that TM egress objects are NOT referred to as queues, because in - * many/most cases they don't have multi-packet structure but instead are - * viewed as a port/spigot through which the TM system schedules and finally - * transfers input packets through. - * - *

Ideal versus Actual Behavior

- * It is important to recognize the difference between the "abstract" - * mathematical model of the prescribed behavior and real implementations. - * The model describes the Ideal, but theoretically desired behavior, but such - * an Ideal is generally not practical to implement. Instead, one understands - * that virtually all Real TM systems attempt to approximate the Ideal behavior - * as given by the TM configuration as best as they can - while still - * attaining high packet processing performance. The idea is that instead of - * trying too hard to be "perfect" at the granularity of say microseconds, it - * may be better to instead try to match the long term Ideal behavior over a - * much more reasonable period of time like a millisecond. It is generally - * better to have a stable implementation that when averaged over a period of - * several milliseconds matches the Ideal behavior very closely than to have - * an implementation that is perhaps more accurate over a period of - * microseconds, but whose millisecond averaged behavior drifts away from the - * Ideal case. - * - *

Other TM Concepts

- * - *

Profiles

- * This specification often packages related TM system parameters into - * records/objects called profiles. These profiles can then be associated with - * various entities like tm_nodes and tm_queue's. This way the amount of - * storage associated with setting related parameters can be reduced and - * in addition it is common to re-use the same set of parameter set over - * and over again, and also to be able to change the parameter set once - * and have it affect lots of entities with which it is associated with/applied - * to. - * - *

Absolute Limits versus odp_tm_capability_t

- * This header file defines some constants representing the absolute maximum - * settings for any TM system, though in most cases a TM system can (and - * should) be created/instantiated with smaller values, since lower values - * will often result in faster operation and/or less memory used. */ /**