Message ID | 20180403215526.15934-1-jrosen@cisco.com |
---|---|
State | New |
Headers | show |
Series | [RFC] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun | expand |
On Tue, 3 Apr 2018 17:55:25 -0400 Jon Rosen <jrosen@cisco.com> wrote: > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which > casues the ring to get corrupted by allowing multiple kernel threads > to claim ownership of the same ring entry, Mark the ring entry as > already being used within the spin_lock to prevent other kernel > threads from reusing the same entry before it's fully filled in, > passed to user space, and then eventually passed back to the kernel > for use with a new packet. > > Note that the proposed change may modify the semantics of the > interface between kernel space and user space in a way which may cause > some applications to no longer work properly. More discussion on this > change can be found in the additional comments section titled > "3. Discussion on packet_mmap ownership semantics:". > > Signed-off-by: Jon Rosen <jrosen@cisco.com> > --- > > Additional Comments Section > --------------------------- > > 1. Description of the diffs: > ---------------------------- > > TPACKET_V1 and TPACKET_V2 format rings: > ------------------------------------------- > Mark each entry as TP_STATUS_IN_PROGRESS after allocating to > prevent other kernel threads from re-using the same entry. > > This is necessary because there may be a delay from the time the > spin_lock is released to the time that the packet is completed and > the corresponding ring entry is marked as owned by user space. If > during this time other kernel threads enqueue more packets to the > ring than the size of the ring then it will cause mutliple kernel > threads to operate on the same entry at the same time, corrupting > packets and the ring state. > > By marking the entry as allocated (IN_PROGRESS) we prevent other > kernel threads from incorrectly re-using an entry that is still in > the progress of being filled in before it is passed to user space. > > This forces each entry through the following states: > > +-> 1. (tp_status == TP_STATUS_KERNEL) > | Free: For use by any kernel thread to store a new packet > | > | 2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER) > | Allocated: In use by a *specific* kernel thread > | > | 3. (tp_status & TP_STATUS_USER) > | Available: Packet available for user space to process > | > +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL > > > No impact on TPACKET_V3 format rings: > ------------------------------------- > Packet entry ownership is already protected from other kernel > threads potentially re-using the same entry. This is done inside > packet_current_rx_frame() where storage is allocated for the > current packet. Since this is done within the spin_lock no > additional status updates for this entry are required. > > > Defining TP_STATUS_IN_PROGRESS: > ------------------------------- > Rather than defining a new-bit we re-use an existing bit for this > intermediate state. Status will eventually be overwritten with the > actual true status when passed to user space. Any bit used to pass > information to user space other than the one that passes ownership > is suitable (can't use TP_STATUS_USER). Alternatively a new bit > could be defined. > > > 2. More detailed discussion: > ---------------------------- > Ring entries basically have 2 states, owned by the kernel or owned by > user space. For single producer/single consumer this works fine. For > multiple producers there is a window between the call to spin_unlock > [F] and the call to __packet_set_status [J] where if there are enough > packets added to the ring by other kernel threads then the ring can > wrap and multiple threads will end up using the same ring entries. > > This occurs because the ring entry alocated at [C] did not modify the > state of the entry so it continues to appear as owned by the kernel > and available for use for new packets even though it has already been > allocated. > > A simple fix is to temporarily mark the ring entries within the spin > lock such that user space will still think it?s owned by the kernel > and other kernel threads will not see it as available to be used for > new packets. If a kernel thread gets delayed between [F] and [J] for > an extended period of time and the ring wraps back to the same point > then subsiquent kernel threads attempts to allocate will fail and be > treated as the ring being full. > > The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit > to prevent other kernel threads from re-using the same entry. Note that > any existing bit other than TP_STATUS_USER could have been used. > > af_packet.c:tpacket_rcv() > ... code removed for brevity ... > > // Acquire spin lock > A: spin_lock(&sk->sk_receive_queue.lock); > > // Preemption is disabled > > // Get current ring entry > B: h.raw = packet_current_rx_frame( > po, skb, TP_STATUS_KERNEL, (macoff+snaplen)); > > // Get out if ring is full > // Code not show but it will also release lock > if (!h.raw) goto ring_is_full; > > // Allocate ring entry > C: packet_increment_rx_head(po, &po->rx_ring); > > // Protect against allocation overrun (the simple fix) > // by changing TP_STATUS_KERNEL to TP_STATUS_IN_PROGRESS > D: if (po->tp_version <= TPACKET_V2) > __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > > // Update counter > E: po->stats.stats1.tp_packets++; > > // Release spin lock > F: spin_unlock(&sk->sk_receive_queue.lock); > > // Copy packet payload > G: skb_copy_bits(skb, 0, h.raw + macoff, snaplen); > > // Get current timestamp > H: if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp))) > getnstimeofday(&ts); > status |= ts_status; > > // Fill in header portions of ring entry (removed a bunch of > // writes for brevity) > ... > h.h1->tp_sec = ts.tv_sec; > h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC; > sll->sll_halen = dev_parse_header(skb, sll->sll_addr); > ... > > // Memory Barrier to make sure all data is written before > // passing ownership to user space > I: smp_mb(); > > // Update Status, passing ownership to user space > J: __packet_set_status(po, h.raw, status | TP_STATUS_USER); > > > 3. Discussion on packet_mmap ownership semantics: > ------------------------------------------------- > One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > is that the documentation of the tp_status field is somewhat > inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > meaning the entry is owned by user space. In other places ownership > by user space is defined by the TP_STATUS_USER(1) bit being set. > > Relevant section of packet_mmap documentation from > https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt > follow but a summary of the semantics in question are described as: > > - At the beginning of each frame there is an status field > - If this field is 0 means that the frame is ready to be > used for the kernel > - If not, there is a frame the user can read > > This would indicate that 0 means owned by the kernel and non-0 is > owned by the user. The simple fix above is to set the status to > something that neither the kernel nor the user thinks they own. That > means that 0 vs. non-0 would be insufficient. > > The description and examples in packet_mmap.txt actually talk about > using a specific bit, the TP_STATUS_USER bit, to indicate something is > owned by the kernel vs something is owned by the user. The relevant > references from packet_mmap.txt to this bit are: > > - The kernel initializes all frames to TP_STATUS_KERNEL(0) > - when the kernel receives a packet it puts in the buffer and > updates the status with at least the TP_STATUS_USER(1) flag > > This description contradicts the previous description which implied > that non-0 means owned by the user. In the above statement it implies > that there needs to be more than just non-0 and that the > TP_STATUS_USER bit must be set for it to be owned by user space. > > If the above proposed fix of using a new TP_STATUS_IN_PROGRESS bit (or > any existing bit other than TP_STATUS_USER) were to be used and an > application were written assuming !TP_STATUS_KERNEL implies owned by > user space then the application would incorrectly assume there was a > valid packet entry to be processed when the entry is not ready. If an > application were instead written to look specificaly for > TP_STATUS_USER then the above proposed fix would work correctly. > > A more complex solution might be to create a shadow data structure > which is private to the kernel and keeps status bits for each ring > entry to indicate the intermediate state between owned by kernel and > owned by user space. > > > 4. Snipet from packet_mmap.txt > ------------------------------ > At the beginning of each frame there is an status field (see > struct tpacket_hdr). If this field is 0 means that the frame is ready > to be used for the kernel, If not, there is a frame the user can read > and the following flags apply: > > +++ Capture process: > from include/linux/if_packet.h > > #define TP_STATUS_COPY (1 << 1) > #define TP_STATUS_LOSING (1 << 2) > #define TP_STATUS_CSUMNOTREADY (1 << 3) > #define TP_STATUS_CSUM_VALID (1 << 7) > > TP_STATUS_COPY : This flag indicates that the frame (and associated > meta information) has been truncated because it's > larger than tp_frame_size. This packet can be > read entirely with recvfrom(). > > In order to make this work it must to be > enabled previously with setsockopt() and > the PACKET_COPY_THRESH option. > > The number of frames that can be buffered to > be read with recvfrom is limited like a normal socket. > See the SO_RCVBUF option in the socket (7) man page. > > TP_STATUS_LOSING : indicates there were packet drops from last time > statistics where checked with getsockopt() and > the PACKET_STATISTICS option. > > TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which > its checksum will be done in hardware. So while > reading the packet we should not try to check the > checksum. > > TP_STATUS_CSUM_VALID : This flag indicates that at least the transport > header checksum of the packet has been already > validated on the kernel side. If the flag is not set > then we are free to check the checksum by ourselves > provided that TP_STATUS_CSUMNOTREADY is also not set. > > for convenience there are also the following defines: > > #define TP_STATUS_KERNEL 0 > #define TP_STATUS_USER 1 > > The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel > receives a packet it puts in the buffer and updates the status with > at least the TP_STATUS_USER flag. Then the user can read the packet, > once the packet is read the user must zero the status field, so the kernel > can use again that frame buffer. > > The user can use poll (any other variant should apply too) to check if new > packets are in the ring: > > struct pollfd pfd; > > pfd.fd = fd; > pfd.revents = 0; > pfd.events = POLLIN|POLLRDNORM|POLLERR; > > if (status == TP_STATUS_KERNEL) > retval = poll(&pfd, 1, timeout); > > It doesn't incur in a race condition to first check the status value and > then poll for frames. > > > 5. Snipet from man pages for packet(7): > --------------------------------------- > > See portion marked with "***" for reference to use of TP_STATUS_USER bit. > > PACKET_RX_RING > > Create a memory-mapped ring buffer for asynchronous > packet reception. The packet socket reserves a > contiguous region of application address space, lays it > out into an array of packet slots and copies packets (up > to tp_snaplen) into subsequent slots. Each packet is > preceded by a metadata structure similar to > tpacket_auxdata. The protocol fields encode the offset > to the data from the start of the metadata header. > tp_net stores the offset to the network layer. If the > packet socket is of type SOCK_DGRAM, then tp_mac is the > same. If it is of type SOCK_RAW, then that field stores > the offset to the link-layer frame. Packet socket and > application communi? cate the head and tail of the ring > through the tp_status field. The packet socket owns all > slots with tp_status equal to TP_STATUS_KERNEL. After > filling a slot, it changes the status of the slot to > *** transfer ownership to the application. During normal > *** operation, the new tp_status value has at least the > *** TP_STATUS_USER bit set to signal that a received packet > *** has been stored. When the application has finished > processing a packet, it transfers ownership of the slot > back to the socket by setting tp_status equal to > TP_STATUS_KERNEL. > > Packet sockets implement multiple variants of the packet > ring. The implementation details are described in > Documentation/networking/packet_mmap.txt in the Linux > kernel source tree. > > > 6. Relevant files: > ------------------ > net/packet/af_packet.c > include/uapi/linux/if_packet.h > Documentation/networking/packet_mmap.txt > > Jon Rosen (1): > packet: mark ring entry as in-use inside spin_lock to prevent RX ring > overrun > > net/packet/af_packet.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index e0f3f4a..264d7b2 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, > if (po->stats.stats1.tp_drops) > status |= TP_STATUS_LOSING; > } > + > + /* > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > + * kernel threads from re-using this same entry. > + */ > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING > + if (po->tp_version <= TPACKET_V2) > + __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > + > po->stats.stats1.tp_packets++; > if (copy_skb) { > status |= TP_STATUS_COPY; This patch looks correct. Please resend it with proper signed-off-by and with a kernel code indenting style (tabs). Is this bug present since the beginning of af_packet and multiqueue devices or did it get introduced in some previous kernel?
On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen <jrosen@cisco.com> wrote: > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which > casues the ring to get corrupted by allowing multiple kernel threads > to claim ownership of the same ring entry, Mark the ring entry as > already being used within the spin_lock to prevent other kernel > threads from reusing the same entry before it's fully filled in, > passed to user space, and then eventually passed back to the kernel > for use with a new packet. > > Note that the proposed change may modify the semantics of the > interface between kernel space and user space in a way which may cause > some applications to no longer work properly. As long as TP_STATUS_USER (1) is not set, userspace should ignore the slot.. > One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > is that the documentation of the tp_status field is somewhat > inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > meaning the entry is owned by user space. In other places ownership > by user space is defined by the TP_STATUS_USER(1) bit being set. But indeed this example in packet_mmap.txt is problematic if (status == TP_STATUS_KERNEL) retval = poll(&pfd, 1, timeout); It does not really matter whether the docs are possibly inconsistent and which one is authoritative. Examples like the above make it likely that some user code expects such code to work. > +++ b/net/packet/af_packet.c > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, > if (po->stats.stats1.tp_drops) > status |= TP_STATUS_LOSING; > } > + > + /* > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > + * kernel threads from re-using this same entry. > + */ > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING No need to reinterpret existing flags. tp_status is a u32 with sufficient undefined bits. > + if (po->tp_version <= TPACKET_V2) > + __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > + > po->stats.stats1.tp_packets++; > if (copy_skb) { > status |= TP_STATUS_COPY; > -- > 2.10.3.dirty >
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > > index e0f3f4a..264d7b2 100644 > > --- a/net/packet/af_packet.c > > +++ b/net/packet/af_packet.c > > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, > > if (po->stats.stats1.tp_drops) > > status |= TP_STATUS_LOSING; > > } > > + > > + /* > > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > > + * kernel threads from re-using this same entry. > > + */ > > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING > > + if (po->tp_version <= TPACKET_V2) > > + __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > > + > > po->stats.stats1.tp_packets++; > > if (copy_skb) { > > status |= TP_STATUS_COPY; > > This patch looks correct. Please resend it with proper signed-off-by > and with a kernel code indenting style (tabs). Is this bug present > since the beginning of af_packet and multiqueue devices or did it get > introduced in some previous kernel? Sorry about the tabs, I'll fix that and try to figure out what I did wrong with the signed-off-by. I've looked back as far as I could find online (2.6.11) and it would appear that this bug has always been there. Thanks, jon.
On Wednesday, April 04, 2018 9:49 AM, Willem de Bruijn <willemb@google.com> wrote: > > On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen <jrosen@cisco.com> wrote: > > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which > > casues the ring to get corrupted by allowing multiple kernel threads > > to claim ownership of the same ring entry, Mark the ring entry as > > already being used within the spin_lock to prevent other kernel > > threads from reusing the same entry before it's fully filled in, > > passed to user space, and then eventually passed back to the kernel > > for use with a new packet. > > > > Note that the proposed change may modify the semantics of the > > interface between kernel space and user space in a way which may cause > > some applications to no longer work properly. > > As long as TP_STATUS_USER (1) is not set, userspace should ignore > the slot.. > > > One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > > is that the documentation of the tp_status field is somewhat > > inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > > meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > > meaning the entry is owned by user space. In other places ownership > > by user space is defined by the TP_STATUS_USER(1) bit being set. > > But indeed this example in packet_mmap.txt is problematic > > if (status == TP_STATUS_KERNEL) > retval = poll(&pfd, 1, timeout); > > It does not really matter whether the docs are possibly inconsistent and > which one is authoritative. Examples like the above make it likely that > some user code expects such code to work. Yes, that's exactly my concern. Yet another troubling example seems to be lipbcap which also is looking specifically for status to be anything other than TP_STATUS_KERNEL(0) to indicate a frame is available in user space. Either way things are broken. They are broken as they stand now because the ring can get overrun and the kernel and user space tracking of the ring can get out of sync. And they are broken with the below change because some user space applications will be looking for anything other than TP_STATUS_KERNEL, so again the ring will get out of sync. The difference here being that the way it is today is on average (across all environments and across all user space apps) less likely to occur while with the change below it is much more likely to occur. Maybe the right answer here is to implement a fix that is compatible for existing applications and accept any potential performance impacts and then add yet another version (TPACKET_V4?) which more strictly requires the TP_STATUS_USER bit for passing ownership. > > > +++ b/net/packet/af_packet.c > > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, > > if (po->stats.stats1.tp_drops) > > status |= TP_STATUS_LOSING; > > } > > + > > + /* > > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other > > + * kernel threads from re-using this same entry. > > + */ > > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING > > No need to reinterpret existing flags. tp_status is a u32 with > sufficient undefined bits. Agreed. > > > + if (po->tp_version <= TPACKET_V2) > > + __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); > > + > > po->stats.stats1.tp_packets++; > > if (copy_skb) { > > status |= TP_STATUS_COPY; > > -- > > 2.10.3.dirty > > Thanks for the feedback! Jon.
>> > One issue with the above proposed change to use TP_STATUS_IN_PROGRESS >> > is that the documentation of the tp_status field is somewhat >> > inconsistent. In some places it's described as TP_STATUS_KERNEL(0) >> > meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) >> > meaning the entry is owned by user space. In other places ownership >> > by user space is defined by the TP_STATUS_USER(1) bit being set. >> >> But indeed this example in packet_mmap.txt is problematic >> >> if (status == TP_STATUS_KERNEL) >> retval = poll(&pfd, 1, timeout); >> >> It does not really matter whether the docs are possibly inconsistent and >> which one is authoritative. Examples like the above make it likely that >> some user code expects such code to work. > > Yes, that's exactly my concern. Yet another troubling example seems to be > lipbcap which also is looking specifically for status to be anything other than > TP_STATUS_KERNEL(0) to indicate a frame is available in user space. Good catch. If pcap-linux.c relies on this then the status field cannot be changed. Other fields can be modified freely while tp_status remains 0, perhaps that's an option.
> >> > One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > >> > is that the documentation of the tp_status field is somewhat > >> > inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > >> > meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > >> > meaning the entry is owned by user space. In other places ownership > >> > by user space is defined by the TP_STATUS_USER(1) bit being set. > >> > >> But indeed this example in packet_mmap.txt is problematic > >> > >> if (status == TP_STATUS_KERNEL) > >> retval = poll(&pfd, 1, timeout); > >> > >> It does not really matter whether the docs are possibly inconsistent and > >> which one is authoritative. Examples like the above make it likely that > >> some user code expects such code to work. > > > > Yes, that's exactly my concern. Yet another troubling example seems to be > > lipbcap which also is looking specifically for status to be anything other than > > TP_STATUS_KERNEL(0) to indicate a frame is available in user space. > > Good catch. If pcap-linux.c relies on this then the status field > cannot be changed. Other fields can be modified freely while tp_status > remains 0, perhaps that's an option. Possibly. Someone else suggested something similar but in at least the one example we thought through it still seemed like it didn't address the problem. For example, let's say we used tp_len == -1 to indicate to other kernel threads that the entry was already in progress. This would require that user space never set tp_len = -1 before returning the entry back to the kernel. If it did then no kernel thread would ever claim ownership and the ring would hang. Now, it seems pretty unlikely that user space would do such a thing so maybe we could look past that, but then we run into the issue that there is still a window of opportunity for other kernel threads to come in and wrap the ring. The reason is we can't set tp_len to the correct length after setting tp_status because user space could grab the entry and see tp_len == -1 so we have to set tp_len before we set tp_status. This means that there is still a window where other kernel threads could come in and see tp_len as something other than -1 and a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry. This puts us back to where we are today (arguably with a smaller window, but a window none the less). Alternatively we could reacquire the spin_lock to then set tp_len followed by tp_status. This would give the necessary indivisibility in the kernel while preserving proper order as made visible to user space, but it comes at the cost of another spin_lock. Thanks for the suggestion. If you can think of a way around this I'm all ears. I'll think on this some more but so far I'm stuck on how to get past having to broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort of atomic construct along with a parallel shadow ring structure (still thinking through that one as well).
Forward link to a new proposed patch at: https://www.mail-archive.com/netdev@vger.kernel.org/msg236629.html
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index e0f3f4a..264d7b2 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, if (po->stats.stats1.tp_drops) status |= TP_STATUS_LOSING; } + + /* + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other + * kernel threads from re-using this same entry. + */ +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING + if (po->tp_version <= TPACKET_V2) + __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); + po->stats.stats1.tp_packets++; if (copy_skb) { status |= TP_STATUS_COPY;
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which casues the ring to get corrupted by allowing multiple kernel threads to claim ownership of the same ring entry, Mark the ring entry as already being used within the spin_lock to prevent other kernel threads from reusing the same entry before it's fully filled in, passed to user space, and then eventually passed back to the kernel for use with a new packet. Note that the proposed change may modify the semantics of the interface between kernel space and user space in a way which may cause some applications to no longer work properly. More discussion on this change can be found in the additional comments section titled "3. Discussion on packet_mmap ownership semantics:". Signed-off-by: Jon Rosen <jrosen@cisco.com> --- Additional Comments Section --------------------------- 1. Description of the diffs: ---------------------------- TPACKET_V1 and TPACKET_V2 format rings: ------------------------------------------- Mark each entry as TP_STATUS_IN_PROGRESS after allocating to prevent other kernel threads from re-using the same entry. This is necessary because there may be a delay from the time the spin_lock is released to the time that the packet is completed and the corresponding ring entry is marked as owned by user space. If during this time other kernel threads enqueue more packets to the ring than the size of the ring then it will cause mutliple kernel threads to operate on the same entry at the same time, corrupting packets and the ring state. By marking the entry as allocated (IN_PROGRESS) we prevent other kernel threads from incorrectly re-using an entry that is still in the progress of being filled in before it is passed to user space. This forces each entry through the following states: +-> 1. (tp_status == TP_STATUS_KERNEL) | Free: For use by any kernel thread to store a new packet | | 2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER) | Allocated: In use by a *specific* kernel thread | | 3. (tp_status & TP_STATUS_USER) | Available: Packet available for user space to process | +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL No impact on TPACKET_V3 format rings: ------------------------------------- Packet entry ownership is already protected from other kernel threads potentially re-using the same entry. This is done inside packet_current_rx_frame() where storage is allocated for the current packet. Since this is done within the spin_lock no additional status updates for this entry are required. Defining TP_STATUS_IN_PROGRESS: ------------------------------- Rather than defining a new-bit we re-use an existing bit for this intermediate state. Status will eventually be overwritten with the actual true status when passed to user space. Any bit used to pass information to user space other than the one that passes ownership is suitable (can't use TP_STATUS_USER). Alternatively a new bit could be defined. 2. More detailed discussion: ---------------------------- Ring entries basically have 2 states, owned by the kernel or owned by user space. For single producer/single consumer this works fine. For multiple producers there is a window between the call to spin_unlock [F] and the call to __packet_set_status [J] where if there are enough packets added to the ring by other kernel threads then the ring can wrap and multiple threads will end up using the same ring entries. This occurs because the ring entry alocated at [C] did not modify the state of the entry so it continues to appear as owned by the kernel and available for use for new packets even though it has already been allocated. A simple fix is to temporarily mark the ring entries within the spin lock such that user space will still think it?s owned by the kernel and other kernel threads will not see it as available to be used for new packets. If a kernel thread gets delayed between [F] and [J] for an extended period of time and the ring wraps back to the same point then subsiquent kernel threads attempts to allocate will fail and be treated as the ring being full. The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit to prevent other kernel threads from re-using the same entry. Note that any existing bit other than TP_STATUS_USER could have been used. af_packet.c:tpacket_rcv() ... code removed for brevity ... // Acquire spin lock A: spin_lock(&sk->sk_receive_queue.lock); // Preemption is disabled // Get current ring entry B: h.raw = packet_current_rx_frame( po, skb, TP_STATUS_KERNEL, (macoff+snaplen)); // Get out if ring is full // Code not show but it will also release lock if (!h.raw) goto ring_is_full; // Allocate ring entry C: packet_increment_rx_head(po, &po->rx_ring); // Protect against allocation overrun (the simple fix) // by changing TP_STATUS_KERNEL to TP_STATUS_IN_PROGRESS D: if (po->tp_version <= TPACKET_V2) __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS); // Update counter E: po->stats.stats1.tp_packets++; // Release spin lock F: spin_unlock(&sk->sk_receive_queue.lock); // Copy packet payload G: skb_copy_bits(skb, 0, h.raw + macoff, snaplen); // Get current timestamp H: if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp))) getnstimeofday(&ts); status |= ts_status; // Fill in header portions of ring entry (removed a bunch of // writes for brevity) ... h.h1->tp_sec = ts.tv_sec; h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC; sll->sll_halen = dev_parse_header(skb, sll->sll_addr); ... // Memory Barrier to make sure all data is written before // passing ownership to user space I: smp_mb(); // Update Status, passing ownership to user space J: __packet_set_status(po, h.raw, status | TP_STATUS_USER); 3. Discussion on packet_mmap ownership semantics: ------------------------------------------------- One issue with the above proposed change to use TP_STATUS_IN_PROGRESS is that the documentation of the tp_status field is somewhat inconsistent. In some places it's described as TP_STATUS_KERNEL(0) meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) meaning the entry is owned by user space. In other places ownership by user space is defined by the TP_STATUS_USER(1) bit being set. Relevant section of packet_mmap documentation from https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt follow but a summary of the semantics in question are described as: - At the beginning of each frame there is an status field - If this field is 0 means that the frame is ready to be used for the kernel - If not, there is a frame the user can read This would indicate that 0 means owned by the kernel and non-0 is owned by the user. The simple fix above is to set the status to something that neither the kernel nor the user thinks they own. That means that 0 vs. non-0 would be insufficient. The description and examples in packet_mmap.txt actually talk about using a specific bit, the TP_STATUS_USER bit, to indicate something is owned by the kernel vs something is owned by the user. The relevant references from packet_mmap.txt to this bit are: - The kernel initializes all frames to TP_STATUS_KERNEL(0) - when the kernel receives a packet it puts in the buffer and updates the status with at least the TP_STATUS_USER(1) flag This description contradicts the previous description which implied that non-0 means owned by the user. In the above statement it implies that there needs to be more than just non-0 and that the TP_STATUS_USER bit must be set for it to be owned by user space. If the above proposed fix of using a new TP_STATUS_IN_PROGRESS bit (or any existing bit other than TP_STATUS_USER) were to be used and an application were written assuming !TP_STATUS_KERNEL implies owned by user space then the application would incorrectly assume there was a valid packet entry to be processed when the entry is not ready. If an application were instead written to look specificaly for TP_STATUS_USER then the above proposed fix would work correctly. A more complex solution might be to create a shadow data structure which is private to the kernel and keeps status bits for each ring entry to indicate the intermediate state between owned by kernel and owned by user space. 4. Snipet from packet_mmap.txt ------------------------------ At the beginning of each frame there is an status field (see struct tpacket_hdr). If this field is 0 means that the frame is ready to be used for the kernel, If not, there is a frame the user can read and the following flags apply: +++ Capture process: from include/linux/if_packet.h #define TP_STATUS_COPY (1 << 1) #define TP_STATUS_LOSING (1 << 2) #define TP_STATUS_CSUMNOTREADY (1 << 3) #define TP_STATUS_CSUM_VALID (1 << 7) TP_STATUS_COPY : This flag indicates that the frame (and associated meta information) has been truncated because it's larger than tp_frame_size. This packet can be read entirely with recvfrom(). In order to make this work it must to be enabled previously with setsockopt() and the PACKET_COPY_THRESH option. The number of frames that can be buffered to be read with recvfrom is limited like a normal socket. See the SO_RCVBUF option in the socket (7) man page. TP_STATUS_LOSING : indicates there were packet drops from last time statistics where checked with getsockopt() and the PACKET_STATISTICS option. TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which its checksum will be done in hardware. So while reading the packet we should not try to check the checksum. TP_STATUS_CSUM_VALID : This flag indicates that at least the transport header checksum of the packet has been already validated on the kernel side. If the flag is not set then we are free to check the checksum by ourselves provided that TP_STATUS_CSUMNOTREADY is also not set. for convenience there are also the following defines: #define TP_STATUS_KERNEL 0 #define TP_STATUS_USER 1 The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel receives a packet it puts in the buffer and updates the status with at least the TP_STATUS_USER flag. Then the user can read the packet, once the packet is read the user must zero the status field, so the kernel can use again that frame buffer. The user can use poll (any other variant should apply too) to check if new packets are in the ring: struct pollfd pfd; pfd.fd = fd; pfd.revents = 0; pfd.events = POLLIN|POLLRDNORM|POLLERR; if (status == TP_STATUS_KERNEL) retval = poll(&pfd, 1, timeout); It doesn't incur in a race condition to first check the status value and then poll for frames. 5. Snipet from man pages for packet(7): --------------------------------------- See portion marked with "***" for reference to use of TP_STATUS_USER bit. PACKET_RX_RING Create a memory-mapped ring buffer for asynchronous packet reception. The packet socket reserves a contiguous region of application address space, lays it out into an array of packet slots and copies packets (up to tp_snaplen) into subsequent slots. Each packet is preceded by a metadata structure similar to tpacket_auxdata. The protocol fields encode the offset to the data from the start of the metadata header. tp_net stores the offset to the network layer. If the packet socket is of type SOCK_DGRAM, then tp_mac is the same. If it is of type SOCK_RAW, then that field stores the offset to the link-layer frame. Packet socket and application communi? cate the head and tail of the ring through the tp_status field. The packet socket owns all slots with tp_status equal to TP_STATUS_KERNEL. After filling a slot, it changes the status of the slot to *** transfer ownership to the application. During normal *** operation, the new tp_status value has at least the *** TP_STATUS_USER bit set to signal that a received packet *** has been stored. When the application has finished processing a packet, it transfers ownership of the slot back to the socket by setting tp_status equal to TP_STATUS_KERNEL. Packet sockets implement multiple variants of the packet ring. The implementation details are described in Documentation/networking/packet_mmap.txt in the Linux kernel source tree. 6. Relevant files: ------------------ net/packet/af_packet.c include/uapi/linux/if_packet.h Documentation/networking/packet_mmap.txt Jon Rosen (1): packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun net/packet/af_packet.c | 9 +++++++++ 1 file changed, 9 insertions(+) -- 2.10.3.dirty