diff mbox series

[net-next,v2,2/4] sch_htb: Hierarchical QoS hardware offload

Message ID 20201211152649.12123-3-maximmi@mellanox.com
State Superseded
Headers show
Series HTB offload | expand

Commit Message

Maxim Mikityanskiy Dec. 11, 2020, 3:26 p.m. UTC
HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.

This solution addresses two main problems of scaling HTB:

1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:

    # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
    classid 1:10

It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:

    # tc filter add dev eth0 egress protocol ip flower dst_port 80
    action skbedit priority 1:10

This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue.

Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.

2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, its algorithm is done in hardware. HTB registers
itself as a multi-queue qdisc, similarly to mq: HTB is attached to the
netdev, and each queue has its own qdisc. The control flow is still done
by HTB: it calls the driver via ndo_setup_tc to replicate the hierarchy
of classes in the NIC. Leaf classes are presented by hardware queues.
The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.

Some features of HTB may be not supported by some particular hardware,
for example, the maximum number of classes may be limited, the
granularity of rate and ceil parameters may be different, etc. - so, the
offload is not enabled by default, a new parameter is used to enable it:

    # tc qdisc replace dev eth0 root handle 1: htb offload

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/linux/netdevice.h            |   1 +
 include/net/pkt_cls.h                |  33 ++
 include/uapi/linux/pkt_sched.h       |   1 +
 net/sched/sch_htb.c                  | 479 +++++++++++++++++++++++++--
 tools/include/uapi/linux/pkt_sched.h |   1 +
 5 files changed, 487 insertions(+), 28 deletions(-)

Comments

Cong Wang Dec. 11, 2020, 7:16 p.m. UTC | #1
On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> HTB doesn't scale well because of contention on a single lock, and it
> also consumes CPU. This patch adds support for offloading HTB to
> hardware that supports hierarchical rate limiting.
>
> This solution addresses two main problems of scaling HTB:
>
> 1. Contention by flow classification. Currently the filters are attached
> to the HTB instance as follows:

I do not think this is the reason, tcf_classify() has been called with RCU
only on the ingress side for a rather long time. What contentions are you
talking about here?

>
>     # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
>     classid 1:10
>
> It's possible to move classification to clsact egress hook, which is
> thread-safe and lock-free:
>
>     # tc filter add dev eth0 egress protocol ip flower dst_port 80
>     action skbedit priority 1:10
>
> This way classification still happens in software, but the lock
> contention is eliminated, and it happens before selecting the TX queue,
> allowing the driver to translate the class to the corresponding hardware
> queue.

Sure, you can use clsact with HTB, or any combinations you like, but you
can't assume your HTB only works with clsact, can you?


>
> Note that this is already compatible with non-offloaded HTB and doesn't
> require changes to the kernel nor iproute2.
>
> 2. Contention by handling packets. HTB is not multi-queue, it attaches
> to a whole net device, and handling of all packets takes the same lock.
> When HTB is offloaded, its algorithm is done in hardware. HTB registers
> itself as a multi-queue qdisc, similarly to mq: HTB is attached to the
> netdev, and each queue has its own qdisc. The control flow is still done
> by HTB: it calls the driver via ndo_setup_tc to replicate the hierarchy
> of classes in the NIC. Leaf classes are presented by hardware queues.
> The data path works as follows: a packet is classified by clsact, the
> driver selects a hardware queue according to its class, and the packet
> is enqueued into this queue's qdisc.

I do _not_ read your code, from what you describe here, it sounds like
you just want a per-queue rate limit, instead of a global one. So why
bothering HTB whose goal is a global rate limit?

And doesn't TBF already work with mq? I mean you can attach it as
a leaf to each mq so that the tree lock will not be shared either, but you'd
lose the benefits of a global rate limit too. EDT does basically the same,
but it never claims to completely replace HTB. ;)

Thanks.
Maxim Mikityanskiy Dec. 14, 2020, 3:12 p.m. UTC | #2
On 2020-12-11 21:16, Cong Wang wrote:
> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:

>>

>> HTB doesn't scale well because of contention on a single lock, and it

>> also consumes CPU. This patch adds support for offloading HTB to

>> hardware that supports hierarchical rate limiting.

>>

>> This solution addresses two main problems of scaling HTB:

>>

>> 1. Contention by flow classification. Currently the filters are attached

>> to the HTB instance as follows:

> 

> I do not think this is the reason, tcf_classify() has been called with RCU

> only on the ingress side for a rather long time. What contentions are you

> talking about here?


When one attaches filters to HTB, tcf_classify is called from 
htb_classify, which is called from htb_enqueue, which is called with the 
root spinlock of the qdisc taken.

>>

>>      # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80

>>      classid 1:10

>>

>> It's possible to move classification to clsact egress hook, which is

>> thread-safe and lock-free:

>>

>>      # tc filter add dev eth0 egress protocol ip flower dst_port 80

>>      action skbedit priority 1:10

>>

>> This way classification still happens in software, but the lock

>> contention is eliminated, and it happens before selecting the TX queue,

>> allowing the driver to translate the class to the corresponding hardware

>> queue.

> 

> Sure, you can use clsact with HTB, or any combinations you like, but you

> can't assume your HTB only works with clsact, can you?


The goal is to eliminate the root lock from the datapath, and the 
traditional filters attached to the HTB itself are handled under that 
lock. I believe it's a sane limitation, given that the offloaded mode is 
a new mode of operation, it's opt-in, and it may also have additional 
hardware-imposed limitations.

> 

>>

>> Note that this is already compatible with non-offloaded HTB and doesn't

>> require changes to the kernel nor iproute2.

>>

>> 2. Contention by handling packets. HTB is not multi-queue, it attaches

>> to a whole net device, and handling of all packets takes the same lock.

>> When HTB is offloaded, its algorithm is done in hardware. HTB registers

>> itself as a multi-queue qdisc, similarly to mq: HTB is attached to the

>> netdev, and each queue has its own qdisc. The control flow is still done

>> by HTB: it calls the driver via ndo_setup_tc to replicate the hierarchy

>> of classes in the NIC. Leaf classes are presented by hardware queues.

>> The data path works as follows: a packet is classified by clsact, the

>> driver selects a hardware queue according to its class, and the packet

>> is enqueued into this queue's qdisc.

> 

> I do _not_ read your code, from what you describe here, it sounds like

> you just want a per-queue rate limit, instead of a global one. So why

> bothering HTB whose goal is a global rate limit?


I would disagree. HTB's goal is hierarchical rate limits with borrowing. 
Sure, it can be used just to set a global limit, but it's main purpose 
is creating a hierarchy of classes.

And yes, we are talking about the whole netdevice here, not about 
per-queue limits (we already support per-queue rate limits with the 
means of tx_maxrate, so we wouldn't need any new code for that). The 
tree of classes is global for the whole netdevice, with hierarchy and 
borrowing supported. These additional send queues can be considered as 
hardware objects that represent offloaded leaf traffic classes (which 
can be extended to multiple queues per class).

So, we are really offloading HTB functionality here, not just using HTB 
interface for something else (something simpler). I hope it sounds 
better for you now.

> And doesn't TBF already work with mq? I mean you can attach it as

> a leaf to each mq so that the tree lock will not be shared either, but you'd

> lose the benefits of a global rate limit too.


Yes, I'd lose not only the global rate limit, but also multi-level 
hierarchical limits, which are all provided by this HTB offload - that's 
why TBF is not really a replacement for this feature.

> EDT does basically the same,

> but it never claims to completely replace HTB. ;)

> 

> Thanks.

>
Cong Wang Dec. 14, 2020, 7:35 p.m. UTC | #3
On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>

> On 2020-12-11 21:16, Cong Wang wrote:

> > On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:

> >>

> >> HTB doesn't scale well because of contention on a single lock, and it

> >> also consumes CPU. This patch adds support for offloading HTB to

> >> hardware that supports hierarchical rate limiting.

> >>

> >> This solution addresses two main problems of scaling HTB:

> >>

> >> 1. Contention by flow classification. Currently the filters are attached

> >> to the HTB instance as follows:

> >

> > I do not think this is the reason, tcf_classify() has been called with RCU

> > only on the ingress side for a rather long time. What contentions are you

> > talking about here?

>

> When one attaches filters to HTB, tcf_classify is called from

> htb_classify, which is called from htb_enqueue, which is called with the

> root spinlock of the qdisc taken.


So it has nothing to do with tcf_classify() itself... :-/

[...]

> > And doesn't TBF already work with mq? I mean you can attach it as

> > a leaf to each mq so that the tree lock will not be shared either, but you'd

> > lose the benefits of a global rate limit too.

>

> Yes, I'd lose not only the global rate limit, but also multi-level

> hierarchical limits, which are all provided by this HTB offload - that's

> why TBF is not really a replacement for this feature.


Interesting, please explain how your HTB offload still has a global rate
limit and borrowing across queues? I simply can't see it, all I can see
is you offload HTB into each queue in ->attach(), where I assume the
hardware will do rate limit on each queue, if the hardware also has a
global control, why it is not reflected on the root qdisc?

Thanks!
Maxim Mikityanskiy Dec. 14, 2020, 8:30 p.m. UTC | #4
On 2020-12-14 21:35, Cong Wang wrote:
> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

>>

>> On 2020-12-11 21:16, Cong Wang wrote:

>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:

>>>>

>>>> HTB doesn't scale well because of contention on a single lock, and it

>>>> also consumes CPU. This patch adds support for offloading HTB to

>>>> hardware that supports hierarchical rate limiting.

>>>>

>>>> This solution addresses two main problems of scaling HTB:

>>>>

>>>> 1. Contention by flow classification. Currently the filters are attached

>>>> to the HTB instance as follows:

>>>

>>> I do not think this is the reason, tcf_classify() has been called with RCU

>>> only on the ingress side for a rather long time. What contentions are you

>>> talking about here?

>>

>> When one attaches filters to HTB, tcf_classify is called from

>> htb_classify, which is called from htb_enqueue, which is called with the

>> root spinlock of the qdisc taken.

> 

> So it has nothing to do with tcf_classify() itself... :-/

> 

> [...]

> 

>>> And doesn't TBF already work with mq? I mean you can attach it as

>>> a leaf to each mq so that the tree lock will not be shared either, but you'd

>>> lose the benefits of a global rate limit too.

>>

>> Yes, I'd lose not only the global rate limit, but also multi-level

>> hierarchical limits, which are all provided by this HTB offload - that's

>> why TBF is not really a replacement for this feature.

> 

> Interesting, please explain how your HTB offload still has a global rate

> limit and borrowing across queues?


Sure, I will explain that.

> I simply can't see it, all I can see

> is you offload HTB into each queue in ->attach(),


In the non-offload mode, the same HTB instance would be attached to all 
queues. In the offload mode, HTB behaves like MQ: there is a root 
instance of HTB, but each queue gets a separate simple qdisc (pfifo). 
Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC 
creates an object for the QoS root.

Then all configuration changes are sent to the driver, and it issues the 
corresponding firmware commands to replicate the whole hierarchy in the 
NIC. Leaf classes correspond to queue groups (in this implementation 
queue groups contain only one queue, but it can be extended), and inner 
classes correspond to entities called TSARs.

The information about rate limits is stored inside TSARs and queue 
groups. Queues know what groups they belong to, and groups and TSARs 
know what TSAR is their parent. A queue is picked in ndo_select_queue by 
looking at the classification result of clsact. So, when a packet is put 
onto a queue, the NIC can track the whole hierarchy and do the HTB 
algorithm.

> where I assume the

> hardware will do rate limit on each queue, 


So, it's not flat in the NIC, and rate limiting is done in a 
hierarchical way.

> if the hardware also has a

> global control, why it is not reflected on the root qdisc?


I'm not sure if I got this last question correctly. The root qdisc is 
HTB, and all the configuration of the HTB tree gets reflected in the 
NIC, as I just explained. I hope now it's clearer, but if you still have 
questions, I'm glad to explain more details (also, I'm ready to respin 
with the minor fixes for the CI build issue on parisc).

Thanks,
Max

> Thanks!

>
Jamal Hadi Salim Dec. 15, 2020, 4:37 p.m. UTC | #5
On 2020-12-14 3:30 p.m., Maxim Mikityanskiy wrote:
> On 2020-12-14 21:35, Cong Wang wrote:

>> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy 

>> <maximmi@nvidia.com> wrote:

>>>

>>> On 2020-12-11 21:16, Cong Wang wrote:

>>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy 

>>>> <maximmi@mellanox.com> wrote:

>>>>>



>>

>> Interesting, please explain how your HTB offload still has a global rate

>> limit and borrowing across queues?

> 

> Sure, I will explain that.

> 

>> I simply can't see it, all I can see

>> is you offload HTB into each queue in ->attach(),

> 

> In the non-offload mode, the same HTB instance would be attached to all 

> queues. In the offload mode, HTB behaves like MQ: there is a root 

> instance of HTB, but each queue gets a separate simple qdisc (pfifo). 

> Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC 

> creates an object for the QoS root.

> 

> Then all configuration changes are sent to the driver, and it issues the 

> corresponding firmware commands to replicate the whole hierarchy in the 

> NIC. Leaf classes correspond to queue groups (in this implementation 

> queue groups contain only one queue, but it can be extended),



FWIW, it is very valuable to be able to abstract HTB if the hardware
can emulate it (users dont have to learn about new abstracts).
Since you are expressing a limitation above:
How does the user discover if they over-provisioned i.e single
queue example above? If there are too many corner cases it may
make sense to just create a new qdisc.

> and inner 

> classes correspond to entities called TSARs.

> 

> The information about rate limits is stored inside TSARs and queue 

> groups. Queues know what groups they belong to, and groups and TSARs 

> know what TSAR is their parent. A queue is picked in ndo_select_queue by 

> looking at the classification result of clsact. So, when a packet is put 

> onto a queue, the NIC can track the whole hierarchy and do the HTB 

> algorithm.

> 


Same question above:
Is there a limit to the number of classes that can be created?
IOW, if someone just created an arbitrary number of queues do they
get errored-out if it doesnt make sense for the hardware?
If such limits exist, it may make sense to provide a knob to query
(maybe ethtool) and if such limits can be adjusted it may be worth
looking at providing interfaces via devlink.

cheers,
jamal


cheers,
jamal
Maxim Mikityanskiy Dec. 16, 2020, 11:47 a.m. UTC | #6
On 2020-12-15 18:37, Jamal Hadi Salim wrote:
> On 2020-12-14 3:30 p.m., Maxim Mikityanskiy wrote:

>> On 2020-12-14 21:35, Cong Wang wrote:

>>> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy 

>>> <maximmi@nvidia.com> wrote:

>>>>

>>>> On 2020-12-11 21:16, Cong Wang wrote:

>>>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy 

>>>>> <maximmi@mellanox.com> wrote:

>>>>>>

> 

> 

>>>

>>> Interesting, please explain how your HTB offload still has a global rate

>>> limit and borrowing across queues?

>>

>> Sure, I will explain that.

>>

>>> I simply can't see it, all I can see

>>> is you offload HTB into each queue in ->attach(),

>>

>> In the non-offload mode, the same HTB instance would be attached to 

>> all queues. In the offload mode, HTB behaves like MQ: there is a root 

>> instance of HTB, but each queue gets a separate simple qdisc (pfifo). 

>> Only the root qdisc (HTB) gets offloaded, and when that happens, the 

>> NIC creates an object for the QoS root.

>>

>> Then all configuration changes are sent to the driver, and it issues 

>> the corresponding firmware commands to replicate the whole hierarchy 

>> in the NIC. Leaf classes correspond to queue groups (in this 

>> implementation queue groups contain only one queue, but it can be 

>> extended),

> 

> 

> FWIW, it is very valuable to be able to abstract HTB if the hardware

> can emulate it (users dont have to learn about new abstracts).


Yes, that's the reason for using an existing interface (HTB) to 
configure the feature.

> Since you are expressing a limitation above:

> How does the user discover if they over-provisioned i.e single

> queue example above?


It comes to the CPU usage. If the core that serves the queue is busy 
with sending packets 100% of time, you need more queues. Also, if the 
user runs more than one application belonging to the same class, and 
pins them to different cores, it makes sense to create more than one queue.

I'd like to emphasize that this is not a hard limitation. Our hardware 
and firmware supports multiple queues per class. What's needed is the 
support from the driver side and probably an additional parameter to tc 
class add to specify the number of queues to reserve.

> If there are too many corner cases it may

> make sense to just create a new qdisc.

> 

>> and inner classes correspond to entities called TSARs.

>>

>> The information about rate limits is stored inside TSARs and queue 

>> groups. Queues know what groups they belong to, and groups and TSARs 

>> know what TSAR is their parent. A queue is picked in ndo_select_queue 

>> by looking at the classification result of clsact. So, when a packet 

>> is put onto a queue, the NIC can track the whole hierarchy and do the 

>> HTB algorithm.

>>

> 

> Same question above:

> Is there a limit to the number of classes that can be created?


Yes, the commit message of the mlx5 patch lists the limitations of our 
NICs. Basically, it's 256 leaf classes and 3 levels of hierarchy.

> IOW, if someone just created an arbitrary number of queues do they

> get errored-out if it doesnt make sense for the hardware?


The current implementation starts failing gracefully if the limits are 
exceeded. The tc command won't succeed, and everything will roll back to 
the stable state, which was just before the tc command.

> If such limits exist, it may make sense to provide a knob to query

> (maybe ethtool)


Sounds legit, but I'm not sure what would be the best interface for 
that. Ethtool is not involved at all in this implementation, and AFAIK 
it doesn't contain any existing command for similar stuff. We could hook 
into set-channels and add new type of channels for HTB, but the 
semantics isn't very clear, because HTB queues != HTB leaf classes, and 
I don't know if it's allowed to extend this interface (if so, I have 
more thoughts of extending it for other purposes).

> and if such limits can be adjusted it may be worth

> looking at providing interfaces via devlink.


Not really. At the moment, there isn't a good reason to decrease the 
maximum limits. It would make sense if it could free up some resources 
for something else, but AFAIK it's not the case now.

Thanks,
Max

> cheers,

> jamal

> 

> 

> cheers,

> jamal
Cong Wang Dec. 16, 2020, 7:01 p.m. UTC | #7
On Mon, Dec 14, 2020 at 12:30 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>

> On 2020-12-14 21:35, Cong Wang wrote:

> > On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

> >>

> >> On 2020-12-11 21:16, Cong Wang wrote:

> >>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:

> >>>>

> >>>> HTB doesn't scale well because of contention on a single lock, and it

> >>>> also consumes CPU. This patch adds support for offloading HTB to

> >>>> hardware that supports hierarchical rate limiting.

> >>>>

> >>>> This solution addresses two main problems of scaling HTB:

> >>>>

> >>>> 1. Contention by flow classification. Currently the filters are attached

> >>>> to the HTB instance as follows:

> >>>

> >>> I do not think this is the reason, tcf_classify() has been called with RCU

> >>> only on the ingress side for a rather long time. What contentions are you

> >>> talking about here?

> >>

> >> When one attaches filters to HTB, tcf_classify is called from

> >> htb_classify, which is called from htb_enqueue, which is called with the

> >> root spinlock of the qdisc taken.

> >

> > So it has nothing to do with tcf_classify() itself... :-/

> >

> > [...]

> >

> >>> And doesn't TBF already work with mq? I mean you can attach it as

> >>> a leaf to each mq so that the tree lock will not be shared either, but you'd

> >>> lose the benefits of a global rate limit too.

> >>

> >> Yes, I'd lose not only the global rate limit, but also multi-level

> >> hierarchical limits, which are all provided by this HTB offload - that's

> >> why TBF is not really a replacement for this feature.

> >

> > Interesting, please explain how your HTB offload still has a global rate

> > limit and borrowing across queues?

>

> Sure, I will explain that.

>

> > I simply can't see it, all I can see

> > is you offload HTB into each queue in ->attach(),

>

> In the non-offload mode, the same HTB instance would be attached to all

> queues. In the offload mode, HTB behaves like MQ: there is a root

> instance of HTB, but each queue gets a separate simple qdisc (pfifo).

> Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC

> creates an object for the QoS root.


Please add this to your changelog.

And why is the offloaded root qdisc not visible to software? All you add to
root HTB are pointers of direct qdisc's and a boolean, this is what I meant
by "not reflected". I expect the hardware parameters/stats are exposed to
software too, but I can't find any.

>

> Then all configuration changes are sent to the driver, and it issues the

> corresponding firmware commands to replicate the whole hierarchy in the

> NIC. Leaf classes correspond to queue groups (in this implementation

> queue groups contain only one queue, but it can be extended), and inner

> classes correspond to entities called TSARs.

>

> The information about rate limits is stored inside TSARs and queue

> groups. Queues know what groups they belong to, and groups and TSARs

> know what TSAR is their parent. A queue is picked in ndo_select_queue by

> looking at the classification result of clsact. So, when a packet is put

> onto a queue, the NIC can track the whole hierarchy and do the HTB

> algorithm.


Glad to know hardware still keeps HTB as a hierarchy.

Please also add this either to source code as comments or in your
changelog, it is very important to understand what is done by hardware.

Thanks.
Maxim Mikityanskiy Dec. 17, 2020, 2:24 p.m. UTC | #8
On 2020-12-16 21:01, Cong Wang wrote:
> On Mon, Dec 14, 2020 at 12:30 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

>>

>> On 2020-12-14 21:35, Cong Wang wrote:

>>> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

>>>>

>>>> On 2020-12-11 21:16, Cong Wang wrote:

>>>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:

>>>>>>

>>>>>> HTB doesn't scale well because of contention on a single lock, and it

>>>>>> also consumes CPU. This patch adds support for offloading HTB to

>>>>>> hardware that supports hierarchical rate limiting.

>>>>>>

>>>>>> This solution addresses two main problems of scaling HTB:

>>>>>>

>>>>>> 1. Contention by flow classification. Currently the filters are attached

>>>>>> to the HTB instance as follows:

>>>>>

>>>>> I do not think this is the reason, tcf_classify() has been called with RCU

>>>>> only on the ingress side for a rather long time. What contentions are you

>>>>> talking about here?

>>>>

>>>> When one attaches filters to HTB, tcf_classify is called from

>>>> htb_classify, which is called from htb_enqueue, which is called with the

>>>> root spinlock of the qdisc taken.

>>>

>>> So it has nothing to do with tcf_classify() itself... :-/

>>>

>>> [...]

>>>

>>>>> And doesn't TBF already work with mq? I mean you can attach it as

>>>>> a leaf to each mq so that the tree lock will not be shared either, but you'd

>>>>> lose the benefits of a global rate limit too.

>>>>

>>>> Yes, I'd lose not only the global rate limit, but also multi-level

>>>> hierarchical limits, which are all provided by this HTB offload - that's

>>>> why TBF is not really a replacement for this feature.

>>>

>>> Interesting, please explain how your HTB offload still has a global rate

>>> limit and borrowing across queues?

>>

>> Sure, I will explain that.

>>

>>> I simply can't see it, all I can see

>>> is you offload HTB into each queue in ->attach(),

>>

>> In the non-offload mode, the same HTB instance would be attached to all

>> queues. In the offload mode, HTB behaves like MQ: there is a root

>> instance of HTB, but each queue gets a separate simple qdisc (pfifo).

>> Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC

>> creates an object for the QoS root.

> 

> Please add this to your changelog.


The similar information is already in the commit message (the paragraph 
under point 2.), but I can rephrase it and elaborate on this, focusing 
on the interaction between HTB, driver and hardware.

> And why is the offloaded root qdisc not visible to software? All you add to

> root HTB are pointers of direct qdisc's and a boolean, this is what I meant

> by "not reflected". I expect the hardware parameters/stats are exposed to

> software too, but I can't find any.


Hardware parameters are rate and ceil, there are no extra parameters 
that exist only in the offload mode. In the future, we may add the 
number of queues to reserve for a class - that will be configured and 
reflected in tc.

Regarding stats, unfortunately, the hardware doesn't expose any 
low-level stats like numbers of tokens, etc. Basic stats like packets 
and bytes are supported and exposed in tc, but they are calculated in 
software (by per-queue qdiscs) - similarly to how we calculate 
per-TX-queue packets and bytes in software.

> 

>>

>> Then all configuration changes are sent to the driver, and it issues the

>> corresponding firmware commands to replicate the whole hierarchy in the

>> NIC. Leaf classes correspond to queue groups (in this implementation

>> queue groups contain only one queue, but it can be extended), and inner

>> classes correspond to entities called TSARs.

>>

>> The information about rate limits is stored inside TSARs and queue

>> groups. Queues know what groups they belong to, and groups and TSARs

>> know what TSAR is their parent. A queue is picked in ndo_select_queue by

>> looking at the classification result of clsact. So, when a packet is put

>> onto a queue, the NIC can track the whole hierarchy and do the HTB

>> algorithm.

> 

> Glad to know hardware still keeps HTB as a hierarchy.

> 

> Please also add this either to source code as comments or in your

> changelog, it is very important to understand what is done by hardware.


OK, as I said above in this letter, I can rephrase the commit message, 
focusing on details about interaction between HTB <-> driver <-> NIC and 
what happens in the NIC. It should look better if I put it in a 
dedicated paragraph, instead of mentioning it under the contention problem.

Thanks,
Max

> Thanks.

>
Jamal Hadi Salim Dec. 17, 2020, 3:09 p.m. UTC | #9
On 2020-12-16 6:47 a.m., Maxim Mikityanskiy wrote:
> On 2020-12-15 18:37, Jamal Hadi Salim wrote:


[..]

>>

>> Same question above:

>> Is there a limit to the number of classes that can be created?

> 

> Yes, the commit message of the mlx5 patch lists the limitations of our 

> NICs. Basically, it's 256 leaf classes and 3 levels of hierarchy.

>


Ok, thats what i was looking for.


>> IOW, if someone just created an arbitrary number of queues do they

>> get errored-out if it doesnt make sense for the hardware?

> 

> The current implementation starts failing gracefully if the limits are 

> exceeded. The tc command won't succeed, and everything will roll back to 

> the stable state, which was just before the tc command.

>


Does the user gets notified somehow or it fails silently?
An extack message would help.


>> If such limits exist, it may make sense to provide a knob to query

>> (maybe ethtool)

> 

> Sounds legit, but I'm not sure what would be the best interface for 

> that. Ethtool is not involved at all in this implementation, and AFAIK 

> it doesn't contain any existing command for similar stuff. We could hook 

> into set-channels and add new type of channels for HTB, but the 

> semantics isn't very clear, because HTB queues != HTB leaf classes, and 

> I don't know if it's allowed to extend this interface (if so, I have 

> more thoughts of extending it for other purposes).

> 


More looking to make sure no suprise to the user. Either the user can
discover what the constraints are or when they provision they get a
a message like "cannot offload more than 3 hierarchies" or "use devlink
if you want to use more than 256 classes", etc.

cheers,
jamal
Maxim Mikityanskiy Dec. 18, 2020, 10:48 a.m. UTC | #10
On 2020-12-17 17:09, Jamal Hadi Salim wrote:
> On 2020-12-16 6:47 a.m., Maxim Mikityanskiy wrote:

>> On 2020-12-15 18:37, Jamal Hadi Salim wrote:

> 

> [..]

> 

>>>

>>> Same question above:

>>> Is there a limit to the number of classes that can be created?

>>

>> Yes, the commit message of the mlx5 patch lists the limitations of our 

>> NICs. Basically, it's 256 leaf classes and 3 levels of hierarchy.

>>

> 

> Ok, thats what i was looking for.

> 

> 

>>> IOW, if someone just created an arbitrary number of queues do they

>>> get errored-out if it doesnt make sense for the hardware?

>>

>> The current implementation starts failing gracefully if the limits are 

>> exceeded. The tc command won't succeed, and everything will roll back 

>> to the stable state, which was just before the tc command.

>>

> 

> Does the user gets notified somehow or it fails silently?

> An extack message would help.

DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1;
	t=1608288498; bh=RMhOdIb4utXHB89dvxKO9kM+jQNSV+kIKsJ0O6KNJWs=;
	h=Subject:To:CC:References:From:Message-ID:Date:User-Agent:
	 MIME-Version:In-Reply-To:Content-Type:Content-Language:
	 Content-Transfer-Encoding:X-Originating-IP:X-ClientProxiedBy;
	b=pThe8Bqxykn4jFcf5ZYZMSgEPZZBiId+lReLZZ9KFCdbROoDwWPTGDWFNhOADAvRA
	 7bjihDOFAvu2+MXP1m9/IS0lQgKXa2iOKAk5S1QwZj+aF5m9bZ1ya7uRSZiK7k32Tp
	 cPyfic9k+SXTN0zyJovtWc5i4QlwnHWCQ7zPvMYwB4Kmvmk/YiRwIQvEooMDQaq92Z
	 7+scnlRaNVAEd8ZERye9Fxa7htaGiUArPQR3aS7ol1QmrCXaN1hZA84lySCaLcJ6JA
	 6qFKvjj0XWP0TsYJqpZzkS+F0yfGOAphtaIMWq/o8m5Julk0e90Yj6maRvGG3kFOaY
	 c4zEilz1uAHKA==

The current implementation doesn't use extack, just returns an error 
code, because many callbacks to the qdisc don't get extack as a 
parameter. However, I agree with you, these messages would be helpful 
for the user when an operation fails due to hardware limitations - it 
will be easier than guessing what caused a EINVAL, so I'll add them. I 
will review which callbacks lacked an extack, and I might add it if it's 
meaningful for them.

> 

>>> If such limits exist, it may make sense to provide a knob to query

>>> (maybe ethtool)

>>

>> Sounds legit, but I'm not sure what would be the best interface for 

>> that. Ethtool is not involved at all in this implementation, and AFAIK 

>> it doesn't contain any existing command for similar stuff. We could 

>> hook into set-channels and add new type of channels for HTB, but the 

>> semantics isn't very clear, because HTB queues != HTB leaf classes, 

>> and I don't know if it's allowed to extend this interface (if so, I 

>> have more thoughts of extending it for other purposes).

>>

> 

> More looking to make sure no suprise to the user. Either the user can

> discover what the constraints are or when they provision they get a

> a message like "cannot offload more than 3 hierarchies" or "use devlink

> if you want to use more than 256 classes", etc.


Yes, it makes perfect sense. Messages are even more user-friendly, as 
for me. So, I'll add such messages to extack, and as the limitations are 
driver-specific, I'll pass extack to the driver.

I will respin when net-next reopens, in the meanwhile comments are welcome.

Thanks,
Max

> 

> cheers,

> jamal
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7bf167993c05..6830a8d2dbe9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -859,6 +859,7 @@  enum tc_setup_type {
 	TC_SETUP_QDISC_ETS,
 	TC_SETUP_QDISC_TBF,
 	TC_SETUP_QDISC_FIFO,
+	TC_SETUP_QDISC_HTB,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 0f2a9c44171c..73ec1639365b 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -783,6 +783,39 @@  struct tc_mq_qopt_offload {
 	};
 };
 
+enum tc_htb_command {
+	/* Root */
+	TC_HTB_CREATE, /* Initialize HTB offload. */
+	TC_HTB_DESTROY, /* Destroy HTB offload. */
+
+	/* Classes */
+	/* Allocate qid and create leaf. */
+	TC_HTB_LEAF_ALLOC_QUEUE,
+	/* Convert leaf to inner, preserve and return qid, create new leaf. */
+	TC_HTB_LEAF_TO_INNER,
+	/* Delete leaf, while siblings remain. */
+	TC_HTB_LEAF_DEL,
+	/* Delete leaf, convert parent to leaf, preserving qid. */
+	TC_HTB_LEAF_DEL_LAST,
+	/* Modify parameters of a node. */
+	TC_HTB_NODE_MODIFY,
+
+	/* Class qdisc */
+	TC_HTB_LEAF_QUERY_QUEUE, /* Query qid by classid. */
+};
+
+struct tc_htb_qopt_offload {
+	enum tc_htb_command command;
+	u16 classid;
+	u32 parent_classid;
+	u16 qid;
+	u16 moved_qid;
+	u64 rate;
+	u64 ceil;
+};
+
+#define TC_HTB_CLASSID_ROOT U32_MAX
+
 enum tc_red_command {
 	TC_RED_REPLACE,
 	TC_RED_DESTROY,
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 9e7c2c607845..79a699f106b1 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -434,6 +434,7 @@  enum {
 	TCA_HTB_RATE64,
 	TCA_HTB_CEIL64,
 	TCA_HTB_PAD,
+	TCA_HTB_OFFLOAD,
 	__TCA_HTB_MAX,
 };
 
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index cd70dbcbd72f..fccdce591104 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -174,6 +174,11 @@  struct htb_sched {
 	int			row_mask[TC_HTB_MAXDEPTH];
 
 	struct htb_level	hlevel[TC_HTB_MAXDEPTH];
+
+	struct Qdisc		**direct_qdiscs;
+	unsigned int            num_direct_qdiscs;
+
+	bool			offload;
 };
 
 /* find class in global hash table using given handle */
@@ -957,7 +962,7 @@  static void htb_reset(struct Qdisc *sch)
 			if (cl->level)
 				memset(&cl->inner, 0, sizeof(cl->inner));
 			else {
-				if (cl->leaf.q)
+				if (cl->leaf.q && !q->offload)
 					qdisc_reset(cl->leaf.q);
 			}
 			cl->prio_activity = 0;
@@ -980,6 +985,7 @@  static const struct nla_policy htb_policy[TCA_HTB_MAX + 1] = {
 	[TCA_HTB_DIRECT_QLEN] = { .type = NLA_U32 },
 	[TCA_HTB_RATE64] = { .type = NLA_U64 },
 	[TCA_HTB_CEIL64] = { .type = NLA_U64 },
+	[TCA_HTB_OFFLOAD] = { .type = NLA_FLAG },
 };
 
 static void htb_work_func(struct work_struct *work)
@@ -992,12 +998,27 @@  static void htb_work_func(struct work_struct *work)
 	rcu_read_unlock();
 }
 
+static void htb_set_lockdep_class_child(struct Qdisc *q)
+{
+	static struct lock_class_key child_key;
+
+	lockdep_set_class(qdisc_lock(q), &child_key);
+}
+
+static int htb_offload(struct net_device *dev, struct tc_htb_qopt_offload *opt)
+{
+	return dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_HTB, opt);
+}
+
 static int htb_init(struct Qdisc *sch, struct nlattr *opt,
 		    struct netlink_ext_ack *extack)
 {
+	struct net_device *dev = qdisc_dev(sch);
+	struct tc_htb_qopt_offload offload_opt;
 	struct htb_sched *q = qdisc_priv(sch);
 	struct nlattr *tb[TCA_HTB_MAX + 1];
 	struct tc_htb_glob *gopt;
+	unsigned int ntx;
 	int err;
 
 	qdisc_watchdog_init(&q->watchdog, sch);
@@ -1022,9 +1043,23 @@  static int htb_init(struct Qdisc *sch, struct nlattr *opt,
 	if (gopt->version != HTB_VER >> 16)
 		return -EINVAL;
 
+	q->offload = nla_get_flag(tb[TCA_HTB_OFFLOAD]);
+
+	if (q->offload) {
+		if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
+			return -EOPNOTSUPP;
+
+		q->num_direct_qdiscs = dev->real_num_tx_queues;
+		q->direct_qdiscs = kcalloc(q->num_direct_qdiscs,
+					   sizeof(*q->direct_qdiscs),
+					   GFP_KERNEL);
+		if (!q->direct_qdiscs)
+			return -ENOMEM;
+	}
+
 	err = qdisc_class_hash_init(&q->clhash);
 	if (err < 0)
-		return err;
+		goto err_free_direct_qdiscs;
 
 	qdisc_skb_head_init(&q->direct_queue);
 
@@ -1037,7 +1072,106 @@  static int htb_init(struct Qdisc *sch, struct nlattr *opt,
 		q->rate2quantum = 1;
 	q->defcls = gopt->defcls;
 
+	if (!q->offload)
+		return 0;
+
+	for (ntx = 0; ntx < q->num_direct_qdiscs; ntx++) {
+		struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, ntx);
+		struct Qdisc *qdisc;
+
+		qdisc = qdisc_create_dflt(dev_queue, &pfifo_qdisc_ops,
+					  TC_H_MAKE(sch->handle, 0), extack);
+		if (!qdisc) {
+			err = -ENOMEM;
+			goto err_free_qdiscs;
+		}
+
+		htb_set_lockdep_class_child(qdisc);
+		q->direct_qdiscs[ntx] = qdisc;
+		qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
+	}
+
+	sch->flags |= TCQ_F_MQROOT;
+
+	offload_opt = (struct tc_htb_qopt_offload) {
+		.command = TC_HTB_CREATE,
+		.parent_classid = TC_H_MAJ(sch->handle) >> 16,
+		.classid = TC_H_MIN(q->defcls),
+	};
+	err = htb_offload(dev, &offload_opt);
+	if (err)
+		goto err_free_qdiscs;
+
 	return 0;
+
+err_free_qdiscs:
+	/* TC_HTB_CREATE call failed, avoid any further calls to the driver. */
+	q->offload = false;
+
+	for (ntx = 0; ntx < q->num_direct_qdiscs && q->direct_qdiscs[ntx];
+	     ntx++)
+		qdisc_put(q->direct_qdiscs[ntx]);
+
+	qdisc_class_hash_destroy(&q->clhash);
+	/* Prevent use-after-free and double-free when htb_destroy gets called.
+	 */
+	q->clhash.hash = NULL;
+	q->clhash.hashsize = 0;
+
+err_free_direct_qdiscs:
+	kfree(q->direct_qdiscs);
+	q->direct_qdiscs = NULL;
+	return err;
+}
+
+static void htb_attach_offload(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct htb_sched *q = qdisc_priv(sch);
+	unsigned int ntx;
+
+	for (ntx = 0; ntx < q->num_direct_qdiscs; ntx++) {
+		struct Qdisc *old, *qdisc = q->direct_qdiscs[ntx];
+
+		old = dev_graft_qdisc(qdisc->dev_queue, qdisc);
+		qdisc_put(old);
+		qdisc_hash_add(qdisc, false);
+	}
+	for (ntx = q->num_direct_qdiscs; ntx < dev->num_tx_queues; ntx++) {
+		struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, ntx);
+		struct Qdisc *old = dev_graft_qdisc(dev_queue, NULL);
+
+		qdisc_put(old);
+	}
+
+	kfree(q->direct_qdiscs);
+	q->direct_qdiscs = NULL;
+}
+
+static void htb_attach_software(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int ntx;
+
+	/* Resemble qdisc_graft behavior. */
+	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, ntx);
+		struct Qdisc *old = dev_graft_qdisc(dev_queue, sch);
+
+		qdisc_refcount_inc(sch);
+
+		qdisc_put(old);
+	}
+}
+
+static void htb_attach(struct Qdisc *sch)
+{
+	struct htb_sched *q = qdisc_priv(sch);
+
+	if (q->offload)
+		htb_attach_offload(sch);
+	else
+		htb_attach_software(sch);
 }
 
 static int htb_dump(struct Qdisc *sch, struct sk_buff *skb)
@@ -1046,6 +1180,11 @@  static int htb_dump(struct Qdisc *sch, struct sk_buff *skb)
 	struct nlattr *nest;
 	struct tc_htb_glob gopt;
 
+	if (q->offload)
+		sch->flags |= TCQ_F_OFFLOADED;
+	else
+		sch->flags &= ~TCQ_F_OFFLOADED;
+
 	sch->qstats.overlimits = q->overlimits;
 	/* Its safe to not acquire qdisc lock. As we hold RTNL,
 	 * no change can happen on the qdisc parameters.
@@ -1063,6 +1202,8 @@  static int htb_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put(skb, TCA_HTB_INIT, sizeof(gopt), &gopt) ||
 	    nla_put_u32(skb, TCA_HTB_DIRECT_QLEN, q->direct_qlen))
 		goto nla_put_failure;
+	if (q->offload && nla_put_flag(skb, TCA_HTB_OFFLOAD))
+		goto nla_put_failure;
 
 	return nla_nest_end(skb, nest);
 
@@ -1144,19 +1285,97 @@  htb_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
 	return gnet_stats_copy_app(d, &cl->xstats, sizeof(cl->xstats));
 }
 
+static struct netdev_queue *
+htb_select_queue(struct Qdisc *sch, struct tcmsg *tcm)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct tc_htb_qopt_offload offload_opt;
+	int err;
+
+	offload_opt = (struct tc_htb_qopt_offload) {
+		.command = TC_HTB_LEAF_QUERY_QUEUE,
+		.classid = TC_H_MIN(tcm->tcm_parent),
+	};
+	err = htb_offload(dev, &offload_opt);
+	if (err || offload_opt.qid >= dev->num_tx_queues)
+		return NULL;
+	return netdev_get_tx_queue(dev, offload_opt.qid);
+}
+
+static struct Qdisc *
+htb_graft_helper(struct netdev_queue *dev_queue, struct Qdisc *new_q)
+{
+	struct net_device *dev = dev_queue->dev;
+	struct Qdisc *old_q;
+
+	if (dev->flags & IFF_UP)
+		dev_deactivate(dev);
+	old_q = dev_graft_qdisc(dev_queue, new_q);
+	if (new_q)
+		new_q->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
+	if (dev->flags & IFF_UP)
+		dev_activate(dev);
+
+	return old_q;
+}
+
+static void htb_offload_move_qdisc(struct Qdisc *sch, u16 qid_old, u16 qid_new)
+{
+	struct netdev_queue *queue_old, *queue_new;
+	struct net_device *dev = qdisc_dev(sch);
+	struct Qdisc *qdisc;
+
+	queue_old = netdev_get_tx_queue(dev, qid_old);
+	queue_new = netdev_get_tx_queue(dev, qid_new);
+
+	if (dev->flags & IFF_UP)
+		dev_deactivate(dev);
+	qdisc = dev_graft_qdisc(queue_old, NULL);
+	qdisc->dev_queue = queue_new;
+	qdisc = dev_graft_qdisc(queue_new, qdisc);
+	if (dev->flags & IFF_UP)
+		dev_activate(dev);
+
+	WARN_ON(!(qdisc->flags & TCQ_F_BUILTIN));
+}
+
 static int htb_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
 		     struct Qdisc **old, struct netlink_ext_ack *extack)
 {
+	struct netdev_queue *dev_queue = sch->dev_queue;
 	struct htb_class *cl = (struct htb_class *)arg;
+	struct htb_sched *q = qdisc_priv(sch);
+	struct Qdisc *old_q;
 
 	if (cl->level)
 		return -EINVAL;
-	if (new == NULL &&
-	    (new = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
-				     cl->common.classid, extack)) == NULL)
-		return -ENOBUFS;
+
+	if (q->offload) {
+		dev_queue = new->dev_queue;
+		WARN_ON(dev_queue != cl->leaf.q->dev_queue);
+	}
+
+	if (!new) {
+		new = qdisc_create_dflt(dev_queue, &pfifo_qdisc_ops,
+					cl->common.classid, extack);
+		if (!new)
+			return -ENOBUFS;
+	}
+
+	if (q->offload) {
+		htb_set_lockdep_class_child(new);
+		/* One ref for cl->leaf.q, the other for dev_queue->qdisc. */
+		qdisc_refcount_inc(new);
+		old_q = htb_graft_helper(dev_queue, new);
+	}
 
 	*old = qdisc_replace(sch, new, &cl->leaf.q);
+
+	if (q->offload) {
+		WARN_ON(old_q != *old);
+		qdisc_put(old_q);
+	}
+
 	return 0;
 }
 
@@ -1184,9 +1403,10 @@  static inline int htb_parent_last_child(struct htb_class *cl)
 	return 1;
 }
 
-static void htb_parent_to_leaf(struct htb_sched *q, struct htb_class *cl,
+static void htb_parent_to_leaf(struct Qdisc *sch, struct htb_class *cl,
 			       struct Qdisc *new_q)
 {
+	struct htb_sched *q = qdisc_priv(sch);
 	struct htb_class *parent = cl->parent;
 
 	WARN_ON(cl->level || !cl->leaf.q || cl->prio_activity);
@@ -1204,6 +1424,61 @@  static void htb_parent_to_leaf(struct htb_sched *q, struct htb_class *cl,
 	parent->cmode = HTB_CAN_SEND;
 }
 
+static void htb_parent_to_leaf_offload(struct Qdisc *sch,
+				       struct netdev_queue *dev_queue,
+				       struct Qdisc *new_q)
+{
+	struct Qdisc *old_q;
+
+	/* One ref for cl->leaf.q, the other for dev_queue->qdisc. */
+	qdisc_refcount_inc(new_q);
+	old_q = htb_graft_helper(dev_queue, new_q);
+	WARN_ON(!(old_q->flags & TCQ_F_BUILTIN));
+}
+
+static void htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
+				      bool last_child, bool destroying)
+{
+	struct tc_htb_qopt_offload offload_opt;
+	struct Qdisc *q = cl->leaf.q;
+	struct Qdisc *old = NULL;
+
+	if (cl->level)
+		return;
+
+	WARN_ON(!q);
+	if (!destroying) {
+		/* On destroy of HTB, two cases are possible:
+		 * 1. q is a normal qdisc, but q->dev_queue has noop qdisc.
+		 * 2. q is a noop qdisc (for nodes that were inner),
+		 *    q->dev_queue is noop_netdev_queue.
+		 */
+		old = htb_graft_helper(q->dev_queue, NULL);
+		WARN_ON(!old);
+		WARN_ON(old != q);
+	}
+
+	offload_opt = (struct tc_htb_qopt_offload) {
+		.command = last_child ? TC_HTB_LEAF_DEL_LAST : TC_HTB_LEAF_DEL,
+		.classid = cl->common.classid,
+	};
+	htb_offload(qdisc_dev(sch), &offload_opt);
+
+	qdisc_put(old);
+
+	if (last_child)
+		return;
+
+	if (offload_opt.moved_qid != 0) {
+		if (destroying)
+			q->dev_queue = netdev_get_tx_queue(qdisc_dev(sch),
+							   offload_opt.qid);
+		else
+			htb_offload_move_qdisc(sch, offload_opt.moved_qid,
+					       offload_opt.qid);
+	}
+}
+
 static void htb_destroy_class(struct Qdisc *sch, struct htb_class *cl)
 {
 	if (!cl->level) {
@@ -1217,8 +1492,11 @@  static void htb_destroy_class(struct Qdisc *sch, struct htb_class *cl)
 
 static void htb_destroy(struct Qdisc *sch)
 {
+	struct net_device *dev = qdisc_dev(sch);
+	struct tc_htb_qopt_offload offload_opt;
 	struct htb_sched *q = qdisc_priv(sch);
 	struct hlist_node *next;
+	bool nonempty, changed;
 	struct htb_class *cl;
 	unsigned int i;
 
@@ -1237,13 +1515,58 @@  static void htb_destroy(struct Qdisc *sch)
 			cl->block = NULL;
 		}
 	}
-	for (i = 0; i < q->clhash.hashsize; i++) {
-		hlist_for_each_entry_safe(cl, next, &q->clhash.hash[i],
-					  common.hnode)
-			htb_destroy_class(sch, cl);
-	}
+
+	do {
+		nonempty = false;
+		changed = false;
+		for (i = 0; i < q->clhash.hashsize; i++) {
+			hlist_for_each_entry_safe(cl, next, &q->clhash.hash[i],
+						  common.hnode) {
+				bool last_child;
+
+				if (!q->offload) {
+					htb_destroy_class(sch, cl);
+					continue;
+				}
+
+				nonempty = true;
+
+				if (cl->level)
+					continue;
+
+				changed = true;
+
+				last_child = htb_parent_last_child(cl);
+				htb_destroy_class_offload(sch, cl, last_child,
+							  true);
+				qdisc_class_hash_remove(&q->clhash,
+							&cl->common);
+				if (cl->parent)
+					cl->parent->children--;
+				if (last_child)
+					htb_parent_to_leaf(sch, cl, NULL);
+				htb_destroy_class(sch, cl);
+			}
+		}
+	} while (changed);
+	WARN_ON(nonempty);
+
 	qdisc_class_hash_destroy(&q->clhash);
 	__qdisc_reset_queue(&q->direct_queue);
+
+	if (!q->offload)
+		return;
+
+	offload_opt = (struct tc_htb_qopt_offload) {
+		.command = TC_HTB_DESTROY,
+	};
+	htb_offload(dev, &offload_opt);
+
+	if (!q->direct_qdiscs)
+		return;
+	for (i = 0; i < q->num_direct_qdiscs && q->direct_qdiscs[i]; i++)
+		qdisc_put(q->direct_qdiscs[i]);
+	kfree(q->direct_qdiscs);
 }
 
 static int htb_delete(struct Qdisc *sch, unsigned long arg)
@@ -1260,11 +1583,24 @@  static int htb_delete(struct Qdisc *sch, unsigned long arg)
 	if (cl->children || cl->filter_cnt)
 		return -EBUSY;
 
-	if (!cl->level && htb_parent_last_child(cl)) {
-		new_q = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+	if (!cl->level && htb_parent_last_child(cl))
+		last_child = 1;
+
+	if (q->offload)
+		htb_destroy_class_offload(sch, cl, last_child, false);
+
+	if (last_child) {
+		struct netdev_queue *dev_queue;
+
+		dev_queue = q->offload ? cl->leaf.q->dev_queue : sch->dev_queue;
+		new_q = qdisc_create_dflt(dev_queue, &pfifo_qdisc_ops,
 					  cl->parent->common.classid,
 					  NULL);
-		last_child = 1;
+		if (q->offload) {
+			if (new_q)
+				htb_set_lockdep_class_child(new_q);
+			htb_parent_to_leaf_offload(sch, dev_queue, new_q);
+		}
 	}
 
 	sch_tree_lock(sch);
@@ -1285,7 +1621,7 @@  static int htb_delete(struct Qdisc *sch, unsigned long arg)
 				  &q->hlevel[cl->level].wait_pq);
 
 	if (last_child)
-		htb_parent_to_leaf(q, cl, new_q);
+		htb_parent_to_leaf(sch, cl, new_q);
 
 	sch_tree_unlock(sch);
 
@@ -1300,9 +1636,11 @@  static int htb_change_class(struct Qdisc *sch, u32 classid,
 	int err = -EINVAL;
 	struct htb_sched *q = qdisc_priv(sch);
 	struct htb_class *cl = (struct htb_class *)*arg, *parent;
+	struct tc_htb_qopt_offload offload_opt;
 	struct nlattr *opt = tca[TCA_OPTIONS];
 	struct nlattr *tb[TCA_HTB_MAX + 1];
 	struct Qdisc *parent_qdisc = NULL;
+	struct netdev_queue *dev_queue;
 	struct tc_htb_opt *hopt;
 	u64 rate64, ceil64;
 	int warn = 0;
@@ -1335,8 +1673,12 @@  static int htb_change_class(struct Qdisc *sch, u32 classid,
 		qdisc_put_rtab(qdisc_get_rtab(&hopt->ceil, tb[TCA_HTB_CTAB],
 					      NULL));
 
+	rate64 = tb[TCA_HTB_RATE64] ? nla_get_u64(tb[TCA_HTB_RATE64]) : 0;
+	ceil64 = tb[TCA_HTB_CEIL64] ? nla_get_u64(tb[TCA_HTB_CEIL64]) : 0;
+
 	if (!cl) {		/* new class */
-		struct Qdisc *new_q;
+		struct net_device *dev = qdisc_dev(sch);
+		struct Qdisc *new_q, *old_q;
 		int prio;
 		struct {
 			struct nlattr		nla;
@@ -1379,11 +1721,8 @@  static int htb_change_class(struct Qdisc *sch, u32 classid,
 						NULL,
 						qdisc_root_sleeping_running(sch),
 						tca[TCA_RATE] ? : &est.nla);
-			if (err) {
-				tcf_block_put(cl->block);
-				kfree(cl);
-				goto failure;
-			}
+			if (err)
+				goto err_block_put;
 		}
 
 		cl->children = 0;
@@ -1392,12 +1731,72 @@  static int htb_change_class(struct Qdisc *sch, u32 classid,
 		for (prio = 0; prio < TC_HTB_NUMPRIO; prio++)
 			RB_CLEAR_NODE(&cl->node[prio]);
 
+		cl->common.classid = classid;
+
+		/* Make sure nothing interrupts us in between of two
+		 * ndo_setup_tc calls.
+		 */
+		ASSERT_RTNL();
+
 		/* create leaf qdisc early because it uses kmalloc(GFP_KERNEL)
 		 * so that can't be used inside of sch_tree_lock
 		 * -- thanks to Karlis Peisenieks
 		 */
-		new_q = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+		if (!q->offload) {
+			dev_queue = sch->dev_queue;
+		} else if (!(parent && !parent->level)) {
+			/* Assign a dev_queue to this classid. */
+			offload_opt = (struct tc_htb_qopt_offload) {
+				.command = TC_HTB_LEAF_ALLOC_QUEUE,
+				.classid = cl->common.classid,
+				.parent_classid = parent ?
+					TC_H_MIN(parent->common.classid) :
+					TC_HTB_CLASSID_ROOT,
+				.rate = max_t(u64, hopt->rate.rate, rate64),
+				.ceil = max_t(u64, hopt->ceil.rate, ceil64),
+			};
+			err = htb_offload(dev, &offload_opt);
+			if (err) {
+				pr_err("htb: TC_HTB_LEAF_ALLOC_QUEUE failed with err = %d\n",
+				       err);
+				goto err_kill_estimator;
+			}
+			dev_queue = netdev_get_tx_queue(dev, offload_opt.qid);
+		} else { /* First child. */
+			dev_queue = parent->leaf.q->dev_queue;
+			old_q = htb_graft_helper(dev_queue, NULL);
+			WARN_ON(old_q != parent->leaf.q);
+			offload_opt = (struct tc_htb_qopt_offload) {
+				.command = TC_HTB_LEAF_TO_INNER,
+				.classid = cl->common.classid,
+				.parent_classid =
+					TC_H_MIN(parent->common.classid),
+				.rate = max_t(u64, hopt->rate.rate, rate64),
+				.ceil = max_t(u64, hopt->ceil.rate, ceil64),
+			};
+			err = htb_offload(dev, &offload_opt);
+			if (err) {
+				pr_err("htb: TC_HTB_LEAF_TO_INNER failed with err = %d\n",
+				       err);
+				htb_graft_helper(dev_queue, old_q);
+				goto err_kill_estimator;
+			}
+			qdisc_put(old_q);
+		}
+		new_q = qdisc_create_dflt(dev_queue, &pfifo_qdisc_ops,
 					  classid, NULL);
+		if (q->offload) {
+			if (new_q) {
+				htb_set_lockdep_class_child(new_q);
+				/* One ref for cl->leaf.q, the other for
+				 * dev_queue->qdisc.
+				 */
+				qdisc_refcount_inc(new_q);
+			}
+			old_q = htb_graft_helper(dev_queue, new_q);
+			/* No qdisc_put needed. */
+			WARN_ON(!(old_q->flags & TCQ_F_BUILTIN));
+		}
 		sch_tree_lock(sch);
 		if (parent && !parent->level) {
 			/* turn parent into inner node */
@@ -1415,10 +1814,10 @@  static int htb_change_class(struct Qdisc *sch, u32 classid,
 					 : TC_HTB_MAXDEPTH) - 1;
 			memset(&parent->inner, 0, sizeof(parent->inner));
 		}
+
 		/* leaf (we) needs elementary qdisc */
 		cl->leaf.q = new_q ? new_q : &noop_qdisc;
 
-		cl->common.classid = classid;
 		cl->parent = parent;
 
 		/* set class to be in HTB_CAN_SEND state */
@@ -1444,12 +1843,29 @@  static int htb_change_class(struct Qdisc *sch, u32 classid,
 			if (err)
 				return err;
 		}
-		sch_tree_lock(sch);
-	}
 
-	rate64 = tb[TCA_HTB_RATE64] ? nla_get_u64(tb[TCA_HTB_RATE64]) : 0;
+		if (q->offload) {
+			struct net_device *dev = qdisc_dev(sch);
 
-	ceil64 = tb[TCA_HTB_CEIL64] ? nla_get_u64(tb[TCA_HTB_CEIL64]) : 0;
+			offload_opt = (struct tc_htb_qopt_offload) {
+				.command = TC_HTB_NODE_MODIFY,
+				.classid = cl->common.classid,
+				.rate = max_t(u64, hopt->rate.rate, rate64),
+				.ceil = max_t(u64, hopt->ceil.rate, ceil64),
+			};
+			err = htb_offload(dev, &offload_opt);
+			if (err)
+				/* Estimator was replaced, and rollback may fail
+				 * as well, so we don't try to recover it, and
+				 * the estimator won't work property with the
+				 * offload anyway, because bstats are updated
+				 * only when the stats are queried.
+				 */
+				return err;
+		}
+
+		sch_tree_lock(sch);
+	}
 
 	psched_ratecfg_precompute(&cl->rate, &hopt->rate, rate64);
 	psched_ratecfg_precompute(&cl->ceil, &hopt->ceil, ceil64);
@@ -1492,6 +1908,11 @@  static int htb_change_class(struct Qdisc *sch, u32 classid,
 	*arg = (unsigned long)cl;
 	return 0;
 
+err_kill_estimator:
+	gen_kill_estimator(&cl->rate_est);
+err_block_put:
+	tcf_block_put(cl->block);
+	kfree(cl);
 failure:
 	return err;
 }
@@ -1557,6 +1978,7 @@  static void htb_walk(struct Qdisc *sch, struct qdisc_walker *arg)
 }
 
 static const struct Qdisc_class_ops htb_class_ops = {
+	.select_queue	=	htb_select_queue,
 	.graft		=	htb_graft,
 	.leaf		=	htb_leaf,
 	.qlen_notify	=	htb_qlen_notify,
@@ -1579,6 +2001,7 @@  static struct Qdisc_ops htb_qdisc_ops __read_mostly = {
 	.dequeue	=	htb_dequeue,
 	.peek		=	qdisc_peek_dequeued,
 	.init		=	htb_init,
+	.attach		=	htb_attach,
 	.reset		=	htb_reset,
 	.destroy	=	htb_destroy,
 	.dump		=	htb_dump,
diff --git a/tools/include/uapi/linux/pkt_sched.h b/tools/include/uapi/linux/pkt_sched.h
index 0d18b1d1fbbc..5c903abc9fa5 100644
--- a/tools/include/uapi/linux/pkt_sched.h
+++ b/tools/include/uapi/linux/pkt_sched.h
@@ -414,6 +414,7 @@  enum {
 	TCA_HTB_RATE64,
 	TCA_HTB_CEIL64,
 	TCA_HTB_PAD,
+	TCA_HTB_OFFLOAD,
 	__TCA_HTB_MAX,
 };