Message ID | 79417f27b7c57da5c0eb54bb6d074d3a472d9ebf.1593209494.git.petrm@mellanox.com |
---|---|
State | New |
Headers | show |
Series | [net-next,v1,1/5] net: sched: Pass root lock to Qdisc_ops.enqueue | expand |
On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote: > The function tcf_qevent_handle() should be invoked when qdisc hits the > "interesting event" corresponding to a block. This function releases root > lock for the duration of executing the attached filters, to allow packets > generated through user actions (notably mirred) to be reinserted to the > same qdisc tree. Are you sure releasing the root lock in the middle of an enqueue operation is a good idea? I mean, it seems racy with qdisc change or reset path, for example, __red_change() could update some RED parameters immediately after you release the root lock. Thanks.
Cong Wang <xiyou.wangcong@gmail.com> writes: > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote: >> The function tcf_qevent_handle() should be invoked when qdisc hits the >> "interesting event" corresponding to a block. This function releases root >> lock for the duration of executing the attached filters, to allow packets >> generated through user actions (notably mirred) to be reinserted to the >> same qdisc tree. > > Are you sure releasing the root lock in the middle of an enqueue operation > is a good idea? I mean, it seems racy with qdisc change or reset path, > for example, __red_change() could update some RED parameters > immediately after you release the root lock. So I had mulled over this for a while. If packets are enqueued or dequeued while the lock is released, maybe the packet under consideration should have been hard_marked instead of prob_marked, or vice versa. (I can't really go to not marked at all, because the fact that we ran the qevent is a very observable commitment to put the packet in the queue with marking.) I figured that is not such a big deal. Regarding a configuration change, for a brief period after the change, a few not-yet-pushed packets could have been enqueued with ECN marking even as I e.g. disabled ECN. This does not seem like a big deal either, these are transient effects.
On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote: > > > Cong Wang <xiyou.wangcong@gmail.com> writes: > > > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote: > >> The function tcf_qevent_handle() should be invoked when qdisc hits the > >> "interesting event" corresponding to a block. This function releases root > >> lock for the duration of executing the attached filters, to allow packets > >> generated through user actions (notably mirred) to be reinserted to the > >> same qdisc tree. > > > > Are you sure releasing the root lock in the middle of an enqueue operation > > is a good idea? I mean, it seems racy with qdisc change or reset path, > > for example, __red_change() could update some RED parameters > > immediately after you release the root lock. > > So I had mulled over this for a while. If packets are enqueued or > dequeued while the lock is released, maybe the packet under > consideration should have been hard_marked instead of prob_marked, or > vice versa. (I can't really go to not marked at all, because the fact > that we ran the qevent is a very observable commitment to put the packet > in the queue with marking.) I figured that is not such a big deal. > > Regarding a configuration change, for a brief period after the change, a > few not-yet-pushed packets could have been enqueued with ECN marking > even as I e.g. disabled ECN. This does not seem like a big deal either, > these are transient effects. Hmm, let's see: 1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc 2. root lock is released by tcf_qevent_handle(). 3. __red_change() acquires the root lock and then changes child qdisc with a new one 4. The old child qdisc is put by qdisc_put() 5. tcf_qevent_handle() acquires the root lock again, and still uses the cached but now freed old child qdisc. Isn't this a problem? Even if it really is not, why do we make tcf_qevent_handle() callers' life so hard? They have to be very careful with race conditions like this. Thanks.
On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote: > The function tcf_qevent_handle() should be invoked when qdisc hits the > "interesting event" corresponding to a block. This function releases root > lock for the duration of executing the attached filters, to allow packets > generated through user actions (notably mirred) to be reinserted to the > same qdisc tree. I read this again, another question here is: why is tcf_qevent_handle() special here? We call tcf_classify() under root lock in many other places too, for example htb_enqueue(), which of course includes act_mirred execution, so why isn't it a problem there? People added MIRRED_RECURSION_LIMIT for this kinda recursion, but never released that root lock. Thanks.
Cong Wang <xiyou.wangcong@gmail.com> writes: > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote: >> The function tcf_qevent_handle() should be invoked when qdisc hits the >> "interesting event" corresponding to a block. This function releases root >> lock for the duration of executing the attached filters, to allow packets >> generated through user actions (notably mirred) to be reinserted to the >> same qdisc tree. > > I read this again, another question here is: why is tcf_qevent_handle() > special here? We call tcf_classify() under root lock in many other places > too, for example htb_enqueue(), which of course includes act_mirred > execution, so why isn't it a problem there? > > People added MIRRED_RECURSION_LIMIT for this kinda recursion, > but never released that root lock. Yes, I realized later that the qdiscs that use tcf_classify() for classification have this exact problem as well. My intention was to fix it by dropping the lock. Since the classification is the first step of enqueing it should not really lead to races, so hopefully this will be OK. I don't have any code as of yet. The recursion limit makes sense for clsact, which is handled out of the root lock.
Cong Wang <xiyou.wangcong@gmail.com> writes: > On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote: >> >> >> Cong Wang <xiyou.wangcong@gmail.com> writes: >> >> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote: >> >> The function tcf_qevent_handle() should be invoked when qdisc hits the >> >> "interesting event" corresponding to a block. This function releases root >> >> lock for the duration of executing the attached filters, to allow packets >> >> generated through user actions (notably mirred) to be reinserted to the >> >> same qdisc tree. >> > >> > Are you sure releasing the root lock in the middle of an enqueue operation >> > is a good idea? I mean, it seems racy with qdisc change or reset path, >> > for example, __red_change() could update some RED parameters >> > immediately after you release the root lock. >> >> So I had mulled over this for a while. If packets are enqueued or >> dequeued while the lock is released, maybe the packet under >> consideration should have been hard_marked instead of prob_marked, or >> vice versa. (I can't really go to not marked at all, because the fact >> that we ran the qevent is a very observable commitment to put the packet >> in the queue with marking.) I figured that is not such a big deal. >> >> Regarding a configuration change, for a brief period after the change, a >> few not-yet-pushed packets could have been enqueued with ECN marking >> even as I e.g. disabled ECN. This does not seem like a big deal either, >> these are transient effects. > > Hmm, let's see: > > 1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc > 2. root lock is released by tcf_qevent_handle(). > 3. __red_change() acquires the root lock and then changes child > qdisc with a new one > 4. The old child qdisc is put by qdisc_put() > 5. tcf_qevent_handle() acquires the root lock again, and still uses > the cached but now freed old child qdisc. > > Isn't this a problem? I missed this. It is a problem, destroy gets called right away and then enqueue would use invalid data. Also qdisc_graft() calls cops->graft, which locks the tree and swaps the qdisc pointes, and then qdisc_put()s the original one. So dropping the root lock can lead to destruction of the qdisc whose enqueue is currently executed. So that's no good. The lock has to be held throughout. > Even if it really is not, why do we make tcf_qevent_handle() callers' > life so hard? They have to be very careful with race conditions like this. Do you have another solution in mind here? I think the deadlock (in both classification and qevents) is an issue, but really don't know how to avoid it except by dropping the lock.
Petr Machata <petrm@mellanox.com> writes: > Cong Wang <xiyou.wangcong@gmail.com> writes: > >> On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote: >>> >>> >>> Cong Wang <xiyou.wangcong@gmail.com> writes: >>> >>> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote: >>> >> The function tcf_qevent_handle() should be invoked when qdisc hits the >>> >> "interesting event" corresponding to a block. This function releases root >>> >> lock for the duration of executing the attached filters, to allow packets >>> >> generated through user actions (notably mirred) to be reinserted to the >>> >> same qdisc tree. >>> > >>> > Are you sure releasing the root lock in the middle of an enqueue operation >>> > is a good idea? I mean, it seems racy with qdisc change or reset path, >>> > for example, __red_change() could update some RED parameters >>> > immediately after you release the root lock. >>> >>> So I had mulled over this for a while. If packets are enqueued or >>> dequeued while the lock is released, maybe the packet under >>> consideration should have been hard_marked instead of prob_marked, or >>> vice versa. (I can't really go to not marked at all, because the fact >>> that we ran the qevent is a very observable commitment to put the packet >>> in the queue with marking.) I figured that is not such a big deal. >>> >>> Regarding a configuration change, for a brief period after the change, a >>> few not-yet-pushed packets could have been enqueued with ECN marking >>> even as I e.g. disabled ECN. This does not seem like a big deal either, >>> these are transient effects. >> >> Hmm, let's see: >> >> 1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc >> 2. root lock is released by tcf_qevent_handle(). >> 3. __red_change() acquires the root lock and then changes child >> qdisc with a new one >> 4. The old child qdisc is put by qdisc_put() >> 5. tcf_qevent_handle() acquires the root lock again, and still uses >> the cached but now freed old child qdisc. >> >> Isn't this a problem? > > I missed this. It is a problem, destroy gets called right away and then > enqueue would use invalid data. > > Also qdisc_graft() calls cops->graft, which locks the tree and swaps the > qdisc pointes, and then qdisc_put()s the original one. So dropping the > root lock can lead to destruction of the qdisc whose enqueue is > currently executed. > > So that's no good. The lock has to be held throughout. > >> Even if it really is not, why do we make tcf_qevent_handle() callers' >> life so hard? They have to be very careful with race conditions like this. > > Do you have another solution in mind here? I think the deadlock (in both > classification and qevents) is an issue, but really don't know how to > avoid it except by dropping the lock. Actually I guess I could qdisc_refcount_inc() the current qdisc so that it doesn't go away. Then when enqueing I could access the child directly, not relying on the now-obsolete cache from the beginning of the enqueue function. I suppose that a similar approach could be used in other users of tcf_classify() as well. What do you think?
On Wed, Jul 8, 2020 at 5:35 AM Petr Machata <petrm@mellanox.com> wrote: > Do you have another solution in mind here? I think the deadlock (in both > classification and qevents) is an issue, but really don't know how to > avoid it except by dropping the lock. Ideally we should only take the lock once, but it clearly requires some work to teach the dev_queue_xmit() in act_mirred not to acquire it again. Thanks.
On Wed, Jul 8, 2020 at 9:21 AM Petr Machata <petrm@mellanox.com> wrote: > > Actually I guess I could qdisc_refcount_inc() the current qdisc so that > it doesn't go away. Then when enqueing I could access the child > directly, not relying on the now-obsolete cache from the beginning of > the enqueue function. I suppose that a similar approach could be used in > other users of tcf_classify() as well. What do you think? The above example is just a quick one I can think of, there could be more race conditions that lead to other kinds of bugs. I am sure you can fix that one, but the point is that it is hard to audit and fix them all. The best solution here is of course not to release that lock, but again it requires some more work to avoid the deadlock. Thanks.
Cong Wang <xiyou.wangcong@gmail.com> writes: > On Wed, Jul 8, 2020 at 5:35 AM Petr Machata <petrm@mellanox.com> wrote: >> Do you have another solution in mind here? I think the deadlock (in both >> classification and qevents) is an issue, but really don't know how to >> avoid it except by dropping the lock. > > Ideally we should only take the lock once, but it clearly requires some > work to teach the dev_queue_xmit() in act_mirred not to acquire it again. act_mirred does not acquire it though. The egress path does. And the packet can traverse several mirred instances on several devices before it gets to the deadlock. Currently I don't see a way to give the egress path a way to know that the lock is already taken. I'll think about it some more. For now I will at least fix the lack of locking.
Petr Machata <petrm@mellanox.com> writes: > Cong Wang <xiyou.wangcong@gmail.com> writes: > > I'll think about it some more. For now I will at least fix the lack of > locking. I guess I could store smp_processor_id() that acquired the lock in struct qdisc_skb_head. Do a trylock instead of lock, and on fail check the stored value. I'll need to be careful about the race between unsuccessful trylock and the test, and about making sure CPU ID doesn't change after it is read. I'll probe this tomorrow.
On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote: > > > Petr Machata <petrm@mellanox.com> writes: > > > Cong Wang <xiyou.wangcong@gmail.com> writes: > > > > I'll think about it some more. For now I will at least fix the lack of > > locking. > > I guess I could store smp_processor_id() that acquired the lock in > struct qdisc_skb_head. Do a trylock instead of lock, and on fail check > the stored value. I'll need to be careful about the race between > unsuccessful trylock and the test, and about making sure CPU ID doesn't > change after it is read. I'll probe this tomorrow. Like __netif_tx_lock(), right? Seems doable. Thanks.
Cong Wang <xiyou.wangcong@gmail.com> writes: > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote: >> >> >> Petr Machata <petrm@mellanox.com> writes: >> >> > Cong Wang <xiyou.wangcong@gmail.com> writes: >> > >> > I'll think about it some more. For now I will at least fix the lack of >> > locking. >> >> I guess I could store smp_processor_id() that acquired the lock in >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check >> the stored value. I'll need to be careful about the race between >> unsuccessful trylock and the test, and about making sure CPU ID doesn't >> change after it is read. I'll probe this tomorrow. > > Like __netif_tx_lock(), right? Seems doable. Good to see it actually used, I wasn't sure if the idea made sense :) Unfortunately it is not enough. Consider two threads (A, B) and two netdevices (eth0, eth1): - "A" takes eth0's root lock and proceeds to classification - "B" takes eth1's root lock and proceeds to classification - "A" invokes mirror to eth1, waits on lock held by "B" - "B" invakes mirror to eth0, waits on lock held by "A" - Some say they are still waiting to this day. So one option that I see is to just stash the mirrored packet in a queue instead of delivering it right away: - s/netif_receive_skb/netif_rx/ in act_mirred - Reuse the RX queue for TX packets as well, differentiating the two by a bit in SKB CB. Then process_backlog() would call either __netif_receive_skb() or dev_queue_transmit(). - Drop mirred_rec_level guard. This seems to work, but I might be missing something non-obvious, such as CB actually being used for something already in that context. I would really rather not introduce a second backlog queue just for mirred though. Since mirred_rec_level does not kick in anymore, the same packet can end up being forwarded from the backlog queue, to the qdisc, and back to the backlog queue, forever. But that seems OK, that's what the admin configured, so that's what's happening. If this is not a good idea for some reason, this might work as well: - Convert the current root lock to an rw lock. Convert all current lockers to write lock (which should be safe), except of enqueue, which will take read lock. That will allow many concurrent threads to enter enqueue, or one thread several times, but it will exclude all other users. So this guards configuration access to the qdisc tree, makes sure qdiscs don't go away from under one's feet. - Introduce another spin lock to guard the private data of the qdisc tree, counters etc., things that even two concurrent enqueue operations shouldn't trample on. Enqueue takes this spin lock after read-locking the root lock. act_mirred drops it before injecting the packet and takes it again afterwards. Any opinions y'all?
On Fri, Jul 10, 2020 at 7:40 AM Petr Machata <petrm@mellanox.com> wrote: > > > Cong Wang <xiyou.wangcong@gmail.com> writes: > > > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote: > >> > >> > >> Petr Machata <petrm@mellanox.com> writes: > >> > >> > Cong Wang <xiyou.wangcong@gmail.com> writes: > >> > > >> > I'll think about it some more. For now I will at least fix the lack of > >> > locking. > >> > >> I guess I could store smp_processor_id() that acquired the lock in > >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check > >> the stored value. I'll need to be careful about the race between > >> unsuccessful trylock and the test, and about making sure CPU ID doesn't > >> change after it is read. I'll probe this tomorrow. > > > > Like __netif_tx_lock(), right? Seems doable. > > Good to see it actually used, I wasn't sure if the idea made sense :) > > Unfortunately it is not enough. > > Consider two threads (A, B) and two netdevices (eth0, eth1): > > - "A" takes eth0's root lock and proceeds to classification > - "B" takes eth1's root lock and proceeds to classification > - "A" invokes mirror to eth1, waits on lock held by "B" > - "B" invakes mirror to eth0, waits on lock held by "A" > - Some say they are still waiting to this day. Sure, AA or ABBA deadlock. > > So one option that I see is to just stash the mirrored packet in a queue > instead of delivering it right away: > > - s/netif_receive_skb/netif_rx/ in act_mirred > > - Reuse the RX queue for TX packets as well, differentiating the two by > a bit in SKB CB. Then process_backlog() would call either > __netif_receive_skb() or dev_queue_transmit(). > > - Drop mirred_rec_level guard. I don't think I follow you, the root qdisc lock is on egress which has nothing to do with ingress, so I don't see how netif_rx() is even involved. > > This seems to work, but I might be missing something non-obvious, such > as CB actually being used for something already in that context. I would > really rather not introduce a second backlog queue just for mirred > though. > > Since mirred_rec_level does not kick in anymore, the same packet can end > up being forwarded from the backlog queue, to the qdisc, and back to the > backlog queue, forever. But that seems OK, that's what the admin > configured, so that's what's happening. > > If this is not a good idea for some reason, this might work as well: > > - Convert the current root lock to an rw lock. Convert all current > lockers to write lock (which should be safe), except of enqueue, which > will take read lock. That will allow many concurrent threads to enter > enqueue, or one thread several times, but it will exclude all other > users. Are you sure we can parallelize enqueue()? They all need to move skb into some queue, which is not able to parallelize with just a read lock. Even the "lockless" qdisc takes a spinlock, r->producer_lock, for enqueue(). > > So this guards configuration access to the qdisc tree, makes sure > qdiscs don't go away from under one's feet. > > - Introduce another spin lock to guard the private data of the qdisc > tree, counters etc., things that even two concurrent enqueue > operations shouldn't trample on. Enqueue takes this spin lock after > read-locking the root lock. act_mirred drops it before injecting the > packet and takes it again afterwards. > > Any opinions y'all? I thought about forbidding mirror/redirecting to the same device, but there might be some legitimate use cases of such. So, I don't have any other ideas yet, perhaps there is some way to refactor dev_queue_xmit() to avoid this deadlock. Thanks.
Cong Wang <xiyou.wangcong@gmail.com> writes: > On Fri, Jul 10, 2020 at 7:40 AM Petr Machata <petrm@mellanox.com> wrote: >> >> >> Cong Wang <xiyou.wangcong@gmail.com> writes: >> >> > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote: >> >> >> >> >> >> Petr Machata <petrm@mellanox.com> writes: >> >> >> >> > Cong Wang <xiyou.wangcong@gmail.com> writes: >> >> > >> >> > I'll think about it some more. For now I will at least fix the lack of >> >> > locking. >> >> >> >> I guess I could store smp_processor_id() that acquired the lock in >> >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check >> >> the stored value. I'll need to be careful about the race between >> >> unsuccessful trylock and the test, and about making sure CPU ID doesn't >> >> change after it is read. I'll probe this tomorrow. >> > >> > Like __netif_tx_lock(), right? Seems doable. >> >> Good to see it actually used, I wasn't sure if the idea made sense :) >> >> Unfortunately it is not enough. >> >> Consider two threads (A, B) and two netdevices (eth0, eth1): >> >> - "A" takes eth0's root lock and proceeds to classification >> - "B" takes eth1's root lock and proceeds to classification >> - "A" invokes mirror to eth1, waits on lock held by "B" >> - "B" invakes mirror to eth0, waits on lock held by "A" >> - Some say they are still waiting to this day. > > Sure, AA or ABBA deadlock. > >> >> So one option that I see is to just stash the mirrored packet in a queue >> instead of delivering it right away: >> >> - s/netif_receive_skb/netif_rx/ in act_mirred >> >> - Reuse the RX queue for TX packets as well, differentiating the two by >> a bit in SKB CB. Then process_backlog() would call either >> __netif_receive_skb() or dev_queue_transmit(). >> >> - Drop mirred_rec_level guard. > > I don't think I follow you, the root qdisc lock is on egress which has > nothing to do with ingress, so I don't see how netif_rx() is even involved. netif_rx() isn't, but __netif_receive_skb() is, and that can lead to the deadlock as well when another mirred redirects it back to the locked egress queue. So a way to solve "mirred ingress dev" action deadlock is to s/netif_receive_skb/netif_rx/. I.e. don't resolve the mirror right away, go through the per-CPU queue. Then "mirred egress dev" could be fixed similarly by repurposing the queue for both ingress and egress, differentiating ingress packets from egress ones by a bit in SKB CB. >> >> This seems to work, but I might be missing something non-obvious, such >> as CB actually being used for something already in that context. I would >> really rather not introduce a second backlog queue just for mirred >> though. >> >> Since mirred_rec_level does not kick in anymore, the same packet can end >> up being forwarded from the backlog queue, to the qdisc, and back to the >> backlog queue, forever. But that seems OK, that's what the admin >> configured, so that's what's happening. >> >> If this is not a good idea for some reason, this might work as well: >> >> - Convert the current root lock to an rw lock. Convert all current >> lockers to write lock (which should be safe), except of enqueue, which >> will take read lock. That will allow many concurrent threads to enter >> enqueue, or one thread several times, but it will exclude all other >> users. > > Are you sure we can parallelize enqueue()? They all need to move > skb into some queue, which is not able to parallelize with just a read > lock. Even the "lockless" qdisc takes a spinlock, r->producer_lock, > for enqueue(). That's why the second spin lock is for. In guards private data, including the queues. >> >> So this guards configuration access to the qdisc tree, makes sure >> qdiscs don't go away from under one's feet. >> >> - Introduce another spin lock to guard the private data of the qdisc >> tree, counters etc., things that even two concurrent enqueue >> operations shouldn't trample on. Enqueue takes this spin lock after >> read-locking the root lock. act_mirred drops it before injecting the >> packet and takes it again afterwards. >> >> Any opinions y'all? > > I thought about forbidding mirror/redirecting to the same device, > but there might be some legitimate use cases of such. So, I don't Yes, and also that's not enough: - A chain of mirreds can achieve the deadlocks as well (i.e. mirror to eth1, redirect back to eth0). Or the ABA case shown above, where it's two actions that don't even work with the same packets causing the deadlock. - I suspect general forwarding could cause this deadlock as well. E.g. redirecting to ingress of a device, where bridge, router take over and bring the packet back to egress. I have not tried reproducing this though, maybe there's a queue or delayed work etc. somewhere in there that makes this not an issue. > have any other ideas yet, perhaps there is some way to refactor > dev_queue_xmit() to avoid this deadlock.
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index ed65619cbc47..b0e11bbb6d7a 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -32,6 +32,12 @@ struct tcf_block_ext_info { u32 block_index; }; +struct tcf_qevent { + struct tcf_block *block; + struct tcf_block_ext_info info; + struct tcf_proto __rcu *filter_chain; +}; + struct tcf_block_cb; bool tcf_queue_work(struct rcu_work *rwork, work_func_t func); @@ -552,6 +558,49 @@ int tc_setup_cb_reoffload(struct tcf_block *block, struct tcf_proto *tp, void *cb_priv, u32 *flags, unsigned int *in_hw_count); unsigned int tcf_exts_num_actions(struct tcf_exts *exts); +#ifdef CONFIG_NET_CLS_ACT +int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch, + enum flow_block_binder_type binder_type, + struct nlattr *block_index_attr, + struct netlink_ext_ack *extack); +void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch); +int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr, + struct netlink_ext_ack *extack); +struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb, + spinlock_t *root_lock, struct sk_buff **to_free, int *ret); +int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe); +#else +static inline int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch, + enum flow_block_binder_type binder_type, + struct nlattr *block_index_attr, + struct netlink_ext_ack *extack) +{ + return 0; +} + +static inline void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch) +{ +} + +static inline int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr, + struct netlink_ext_ack *extack) +{ + return 0; +} + +static inline struct sk_buff * +tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb, + spinlock_t *root_lock, struct sk_buff **to_free, int *ret) +{ + return skb; +} + +static inline int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe) +{ + return 0; +} +#endif + struct tc_cls_u32_knode { struct tcf_exts *exts; struct tcf_result *res; diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index a00a203b2ef5..c594a6343be1 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -3741,6 +3741,125 @@ unsigned int tcf_exts_num_actions(struct tcf_exts *exts) } EXPORT_SYMBOL(tcf_exts_num_actions); +#ifdef CONFIG_NET_CLS_ACT +static int tcf_qevent_parse_block_index(struct nlattr *block_index_attr, + u32 *p_block_index, + struct netlink_ext_ack *extack) +{ + *p_block_index = nla_get_u32(block_index_attr); + if (!*p_block_index) { + NL_SET_ERR_MSG(extack, "Block number may not be zero"); + return -EINVAL; + } + + return 0; +} + +int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch, + enum flow_block_binder_type binder_type, + struct nlattr *block_index_attr, + struct netlink_ext_ack *extack) +{ + u32 block_index; + int err; + + if (!block_index_attr) + return 0; + + err = tcf_qevent_parse_block_index(block_index_attr, &block_index, extack); + if (err) + return err; + + if (!block_index) + return 0; + + qe->info.binder_type = binder_type; + qe->info.chain_head_change = tcf_chain_head_change_dflt; + qe->info.chain_head_change_priv = &qe->filter_chain; + qe->info.block_index = block_index; + + return tcf_block_get_ext(&qe->block, sch, &qe->info, extack); +} +EXPORT_SYMBOL(tcf_qevent_init); + +void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch) +{ + if (qe->info.block_index) + tcf_block_put_ext(qe->block, sch, &qe->info); +} +EXPORT_SYMBOL(tcf_qevent_destroy); + +int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr, + struct netlink_ext_ack *extack) +{ + u32 block_index; + int err; + + if (!block_index_attr) + return 0; + + err = tcf_qevent_parse_block_index(block_index_attr, &block_index, extack); + if (err) + return err; + + /* Bounce newly-configured block or change in block. */ + if (block_index != qe->info.block_index) { + NL_SET_ERR_MSG(extack, "Change of blocks is not supported"); + return -EINVAL; + } + + return 0; +} +EXPORT_SYMBOL(tcf_qevent_validate_change); + +struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb, + spinlock_t *root_lock, struct sk_buff **to_free, int *ret) +{ + struct tcf_result cl_res; + struct tcf_proto *fl; + + if (!qe->info.block_index) + return skb; + + fl = rcu_dereference_bh(qe->filter_chain); + + if (root_lock) + spin_unlock(root_lock); + + switch (tcf_classify(skb, fl, &cl_res, false)) { + case TC_ACT_SHOT: + qdisc_qstats_drop(sch); + __qdisc_drop(skb, to_free); + *ret = __NET_XMIT_BYPASS; + return NULL; + case TC_ACT_STOLEN: + case TC_ACT_QUEUED: + case TC_ACT_TRAP: + __qdisc_drop(skb, to_free); + *ret = __NET_XMIT_STOLEN; + return NULL; + case TC_ACT_REDIRECT: + skb_do_redirect(skb); + *ret = __NET_XMIT_STOLEN; + return NULL; + } + + if (root_lock) + spin_lock(root_lock); + + return skb; +} +EXPORT_SYMBOL(tcf_qevent_handle); + +int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe) +{ + if (!qe->info.block_index) + return 0; + return nla_put_u32(skb, attr_name, qe->info.block_index); +} +EXPORT_SYMBOL(tcf_qevent_dump); +#endif + static __net_init int tcf_net_init(struct net *net) { struct tcf_net *tn = net_generic(net, tcf_net_id);
Qevents are attach points for TC blocks, where filters can be put that are executed when "interesting events" take place in a qdisc. The data to keep and the functions to invoke to maintain a qevent will be largely the same between qevents. Therefore introduce sched-wide helpers for qevent management. Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc, blocks attachment cannot be changed after the qdisc is created. To that end, add a helper tcf_qevent_validate_change(), which verifies whether block index attribute is not attached, or if it is, whether its value matches the current one (i.e. there is no material change). The function tcf_qevent_handle() should be invoked when qdisc hits the "interesting event" corresponding to a block. This function releases root lock for the duration of executing the attached filters, to allow packets generated through user actions (notably mirred) to be reinserted to the same qdisc tree. Signed-off-by: Petr Machata <petrm@mellanox.com> --- include/net/pkt_cls.h | 49 +++++++++++++++++ net/sched/cls_api.c | 119 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 168 insertions(+)