Message ID | 20180103223827.39601-1-mark.rutland@arm.com |
---|---|
Headers | show |
Series | API for inhibiting speculative arbitrary read primitives | expand |
On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: > The 'if_nospec' primitive marks locations where the kernel is disabling > speculative execution that could potentially access privileged data. It > is expected to be paired with a 'nospec_{ptr,load}' where the user > controlled value is actually consumed. I'm much less worried about these "nospec_load/if" macros, than I am about having a sane way to determine when they should be needed. Is there such a sane model right now, or are we talking "people will randomly add these based on strong feelings"? Linus
On Wed, 3 Jan 2018 16:39:31 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > The 'if_nospec' primitive marks locations where the kernel is disabling > > speculative execution that could potentially access privileged data. It > > is expected to be paired with a 'nospec_{ptr,load}' where the user > > controlled value is actually consumed. > > I'm much less worried about these "nospec_load/if" macros, than I am > about having a sane way to determine when they should be needed. > > Is there such a sane model right now, or are we talking "people will > randomly add these based on strong feelings"? There are people trying to tune coverity and other tool rules to identify cases, and some of the work so far was done that way. For x86 we didn't find too many so far so either the needed pattern is uncommon or .... 8) Given you can execute over a hundred basic instructions in a speculation window it does need to be a tool that can explore not just in function but across functions. That's really tough for the compiler itself to do without help. What remains to be seen is if there are other patterns that affect different processors. In the longer term the compiler itself needs to know what is and isn't safe (ie you need to be able to write things like void foo(tainted __user int *x) and have the compiler figure out what level of speculation it can do and (on processors with those features like IA64) when it can and can't do various kinds of non-trapping loads. Alan
[ adding Julia and Dan ] On Wed, Jan 3, 2018 at 5:07 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: > On Wed, 3 Jan 2018 16:39:31 -0800 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> > The 'if_nospec' primitive marks locations where the kernel is disabling >> > speculative execution that could potentially access privileged data. It >> > is expected to be paired with a 'nospec_{ptr,load}' where the user >> > controlled value is actually consumed. >> >> I'm much less worried about these "nospec_load/if" macros, than I am >> about having a sane way to determine when they should be needed. >> >> Is there such a sane model right now, or are we talking "people will >> randomly add these based on strong feelings"? > > There are people trying to tune coverity and other tool rules to identify > cases, and some of the work so far was done that way. For x86 we didn't > find too many so far so either the needed pattern is uncommon or .... 8) > > Given you can execute over a hundred basic instructions in a speculation > window it does need to be a tool that can explore not just in function > but across functions. That's really tough for the compiler itself to do > without help. > > What remains to be seen is if there are other patterns that affect > different processors. > > In the longer term the compiler itself needs to know what is and isn't > safe (ie you need to be able to write things like > > void foo(tainted __user int *x) > > and have the compiler figure out what level of speculation it can do and > (on processors with those features like IA64) when it can and can't do > various kinds of non-trapping loads. > It would be great if coccinelle and/or smatch could be taught to catch some of these case at least as a first pass "please audit this code block" type of notification.
On Thu, 4 Jan 2018, Alan Cox wrote: > There are people trying to tune coverity and other tool rules to identify > cases, Yeah, but that (and *especially* Coverity) is so inconvenient to be applied to each and every patch ... that this is not the way to go. If the CPU speculation can cause these kinds of side-effects, it just must not happen, full stop. OS trying to work it around is just a whack-a-mole (which we can apply for old silicon, sure ... but not something that becomes a new standard) that's never going to lead to any ultimate solution. -- Jiri Kosina SUSE Labs
On Thu, 4 Jan 2018 02:27:54 +0100 (CET) Jiri Kosina <jikos@kernel.org> wrote: > On Thu, 4 Jan 2018, Alan Cox wrote: > > > There are people trying to tune coverity and other tool rules to identify > > cases, > > Yeah, but that (and *especially* Coverity) is so inconvenient to be > applied to each and every patch ... that this is not the way to go. Agreed enitely - and coverity is non-free which is even worse as a dependancy. Right now we are in the 'what could be done quickly by a few people' space. The papers are now published, so the world can work on better solutions and extending more convenient tooling. > If the CPU speculation can cause these kinds of side-effects, it just must > not happen, full stop. At which point your performance will resemble that of a 2012 atom processor at best. > OS trying to work it around is just a whack-a-mole > (which we can apply for old silicon, sure ... but not something that > becomes a new standard) that's never going to lead to any ultimate > solution. In the ideal world it becomes possible for future processors to resolve such markings down to no-ops. Will that be possible or will we get more explicit ways to tell the processor what is unsafe - I don't personally know but I do know that turning off speculation is not the answer. Clearly if the CPU must be told then C is going to have to grow some syntax for it and some other languages are going to see 'taint' moving from a purely software construct to a real processor one. Alan
On Thu, 4 Jan 2018, Alan Cox wrote: > > If the CPU speculation can cause these kinds of side-effects, it just must > > not happen, full stop. > > At which point your performance will resemble that of a 2012 atom > processor at best. You know what? I'd be completely fine with that, if it's traded for "my ssh and internet banking keys are JUST MINE, ok?" :) > Clearly if the CPU must be told then C is going to have to grow some > syntax for it and some other languages are going to see 'taint' moving > from a purely software construct to a real processor one. Again, what is the problem with "yeah, it turned out to speed things up in the past, but hey, it has security issues, so we stopped doing it" aproach? Thanks, -- Jiri Kosina SUSE Labs
On Wed, Jan 3, 2018 at 5:27 PM, Jiri Kosina <jikos@kernel.org> wrote: > On Thu, 4 Jan 2018, Alan Cox wrote: > >> There are people trying to tune coverity and other tool rules to identify >> cases, > > Yeah, but that (and *especially* Coverity) is so inconvenient to be > applied to each and every patch ... that this is not the way to go. > > If the CPU speculation can cause these kinds of side-effects, it just must > not happen, full stop. OS trying to work it around is just a whack-a-mole > (which we can apply for old silicon, sure ... but not something that > becomes a new standard) that's never going to lead to any ultimate > solution. Speaking from a purely Linux kernel maintenance process perspective we play wack-a-mole with missed endian conversions and other bugs that coccinelle, sparse, etc help us catch. So this is in that same category, but yes, it's inconvenient. Alan already mentioned potential compiler changes and annotating variables so that the compiler can help in the future, but until then we need these explicit checks. Elena has done the work of auditing static analysis reports to a dozen or so locations that need some 'nospec' handling. The point of this RFC is to level set across architectures on the macros/names/mechanisms that should be used for the first round of fixes.
On Wed, Jan 3, 2018 at 5:51 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > Elena has done the work of auditing static analysis reports to a dozen > or so locations that need some 'nospec' handling. I'd love to see that patch, just to see how bad things look. Because I think that really is very relevant to the interface too. If we're talking "a dozen locations" that are fairly well constrained, that's very different from having thousands all over the place. Linus
On Wed, 3 Jan 2018, Dan Williams wrote: > Speaking from a purely Linux kernel maintenance process perspective we > play wack-a-mole with missed endian conversions and other bugs that > coccinelle, sparse, etc help us catch. Fully agreed. > So this is in that same category, but yes, it's inconvenient. Disagreed, violently. CPU has to execute the instructions I ask it to execute, and if it executes *anything* else that reveals any information about the instructions that have *not* been executed, it's flawed. > Elena has done the work of auditing static analysis reports to a dozen > or so locations that need some 'nospec' handling. How exactly is that related (especially in longer-term support terms) to BPF anyway? Thanks, -- Jiri Kosina SUSE Labs
> Disagreed, violently. CPU has to execute the instructions I ask it to > execute, and if it executes *anything* else that reveals any information > about the instructions that have *not* been executed, it's flawed. Then stick to in order processors. Just don't be in a hurry to get your computation finished. > > Elena has done the work of auditing static analysis reports to a dozen > > or so locations that need some 'nospec' handling. > > How exactly is that related (especially in longer-term support terms) to > BPF anyway? If you read the papers you need a very specific construct in order to not only cause a speculative load of an address you choose but also to then manage to cause a second operation that in some way reveals bits of data or allows you to ask questions. BPF allows you to construct those sequences relatively easily and it's the one case where a user space application can fairly easily place code it wants to execute in the kernel. Without BPF you have to find the right construct in the kernel, prime all the right predictions and measure the result without getting killed off. There are places you can do that but they are not so easy and we don't (at this point) think there are that many. The same situation occurs in user space with interpreters and JITs,hence the paper talking about javascript. Any JIT with the ability to do timing is particularly vulnerable to versions of this specific attack because the attacker gets to create the code pattern rather than have to find it. Alan
On Wed, 2018-01-03 at 17:54 -0800, Linus Torvalds wrote: > On Wed, Jan 3, 2018 at 5:51 PM, Dan Williams <dan.j.williams@intel.co > m> wrote: > > > > Elena has done the work of auditing static analysis reports to a > > dozen > > or so locations that need some 'nospec' handling. > > I'd love to see that patch, just to see how bad things look. > > Because I think that really is very relevant to the interface too. > > If we're talking "a dozen locations" that are fairly well > constrained, > that's very different from having thousands all over the place. > Note, the following is Elena's work, I'm just helping poke the upstream discussion along while she's offline. Elena audited the static analysis reports down to the following locations where userspace provides a value for a conditional branch and then later creates a dependent load on that same userspace controlled value. 'osb()', observable memory barrier, resolves to an lfence on x86. My concern with these changes is that it is not clear what content within the branch block is of concern. Peter's 'if_nospec' proposal combined with Mark's 'nospec_load' would seem to clear that up, from a self documenting code perspective, and let archs optionally protect entire conditional blocks or individual loads within those blocks. Note that these are "a human looked at static analysis reports and could not rationalize that these are false positives". Specific domain knowledge about these paths may find that some of them are indeed false positives. The change to m_start in kernel/user_namespace.c is interesting because that's an example where the nospec_load() approach by itself would need to barrier speculation twice whereas if_nospec can do it once for the whole block. [ forgive any whitespace damage ] 8<--- diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c index 3e7e283a44a8..65175bbe805f 100644 --- a/drivers/media/usb/uvc/uvc_v4l2.c +++ b/drivers/media/usb/uvc/uvc_v4l2.c @@ -821,6 +821,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh, } pin = iterm->id; } else if (index < selector->bNrInPins) { + osb(); pin = selector->baSourceID[index]; list_for_each_entry(iterm, &chain->entities, chain) { if (!UVC_ENTITY_IS_ITERM(iterm)) diff --git a/drivers/net/wireless/ath/carl9170/main.c b/drivers/net/wireless/ath/carl9170/main.c index 988c8857d78c..cf267b709af6 100644 --- a/drivers/net/wireless/ath/carl9170/main.c +++ b/drivers/net/wireless/ath/carl9170/main.c @@ -1388,6 +1388,7 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw, mutex_lock(&ar->mutex); if (queue < ar->hw->queues) { + osb(); memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param)); ret = carl9170_set_qos(ar); } else { diff --git a/drivers/net/wireless/intersil/p54/main.c b/drivers/net/wireless/intersil/p54/main.c index ab6d39e12069..f024f1ad81ff 100644 --- a/drivers/net/wireless/intersil/p54/main.c +++ b/drivers/net/wireless/intersil/p54/main.c @@ -415,6 +415,7 @@ static int p54_conf_tx(struct ieee80211_hw *dev, mutex_lock(&priv->conf_mutex); if (queue < dev->queues) { + osb(); P54_SET_QUEUE(priv->qos_params[queue], params->aifs, params->cw_min, params->cw_max, params->txop); ret = p54_set_edcf(priv); diff --git a/drivers/net/wireless/st/cw1200/sta.c b/drivers/net/wireless/st/cw1200/sta.c index 38678e9a0562..b4a2a7ba04e8 100644 --- a/drivers/net/wireless/st/cw1200/sta.c +++ b/drivers/net/wireless/st/cw1200/sta.c @@ -619,6 +619,7 @@ int cw1200_conf_tx(struct ieee80211_hw *dev, struct ieee80211_vif *vif, mutex_lock(&priv->conf_mutex); if (queue < dev->queues) { + osb(); old_uapsd_flags = le16_to_cpu(priv->uapsd_info.uapsd_flags); WSM_TX_QUEUE_SET(&priv->tx_queue_params, queue, 0, 0, 0); diff --git a/drivers/scsi/qla2xxx/qla_mr.c b/drivers/scsi/qla2xxx/qla_mr.c index d5da3981cefe..dec8b6e087e3 100644 --- a/drivers/scsi/qla2xxx/qla_mr.c +++ b/drivers/scsi/qla2xxx/qla_mr.c @@ -2304,10 +2304,12 @@ qlafx00_status_entry(scsi_qla_host_t *vha, struct rsp_que *rsp, void *pkt) req = ha->req_q_map[que]; /* Validate handle. */ - if (handle < req->num_outstanding_cmds) + if (handle < req->num_outstanding_cmds) { + osb(); sp = req->outstanding_cmds[handle]; - else + } else { sp = NULL; + } if (sp == NULL) { ql_dbg(ql_dbg_io, vha, 0x3034, @@ -2655,10 +2657,12 @@ qlafx00_multistatus_entry(struct scsi_qla_host *vha, req = ha->req_q_map[que]; /* Validate handle. */ - if (handle < req->num_outstanding_cmds) + if (handle < req->num_outstanding_cmds) { + osb(); sp = req->outstanding_cmds[handle]; - else + } else { sp = NULL; + } if (sp == NULL) { ql_dbg(ql_dbg_io, vha, 0x3044, diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c index 145a5c53ff5c..d732b34e120d 100644 --- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c +++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c @@ -57,15 +57,16 @@ static int int340x_thermal_get_trip_temp(struct thermal_zone_device *zone, if (d->override_ops && d->override_ops->get_trip_temp) return d->override_ops->get_trip_temp(zone, trip, temp); - if (trip < d->aux_trip_nr) + if (trip < d->aux_trip_nr) { + osb(); *temp = d->aux_trips[trip]; - else if (trip == d->crt_trip_id) + } else if (trip == d->crt_trip_id) { *temp = d->crt_temp; - else if (trip == d->psv_trip_id) + } else if (trip == d->psv_trip_id) { *temp = d->psv_temp; - else if (trip == d->hot_trip_id) + } else if (trip == d->hot_trip_id) { *temp = d->hot_temp; - else { + } else { for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT; i++) { if (d->act_trips[i].valid && d->act_trips[i].id == trip) { diff --git a/fs/udf/misc.c b/fs/udf/misc.c index 401e64cde1be..c5394760d26b 100644 --- a/fs/udf/misc.c +++ b/fs/udf/misc.c @@ -104,6 +104,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size, iinfo->i_lenEAttr) { uint32_t aal = le32_to_cpu(eahd->appAttrLocation); + + osb(); memmove(&ea[offset - aal + size], &ea[aal], offset - aal); offset -= aal; @@ -114,6 +116,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size, iinfo->i_lenEAttr) { uint32_t ial = le32_to_cpu(eahd->impAttrLocation); + + osb(); memmove(&ea[offset - ial + size], &ea[ial], offset - ial); offset -= ial; @@ -125,6 +129,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size, iinfo->i_lenEAttr) { uint32_t aal = le32_to_cpu(eahd->appAttrLocation); + + osb(); memmove(&ea[offset - aal + size], &ea[aal], offset - aal); offset -= aal; diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 1c65817673db..dbc12007da51 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i { struct fdtable *fdt = rcu_dereference_raw(files->fdt); - if (fd < fdt->max_fds) + if (fd < fdt->max_fds) { + osb(); return rcu_dereference_raw(fdt->fd[fd]); + } return NULL; } diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 246d4d4ce5c7..aa0be8cef2d4 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos, { loff_t pos = *ppos; unsigned extents = map->nr_extents; - smp_rmb(); - if (pos >= extents) - return NULL; + /* paired with smp_wmb in map_write */ + smp_rmb(); - if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) - return &map->extent[pos]; + if (pos < extents) { + osb(); + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) + return &map->extent[pos]; + return &map->forward[pos]; + } - return &map->forward[pos]; + return NULL; } static void *uid_m_start(struct seq_file *seq, loff_t *ppos) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 125c1eab3eaa..56909c8fa025 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -476,6 +476,7 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd, if (offset < rfv->hlen) { int copy = min(rfv->hlen - offset, len); + osb(); if (skb->ip_summed == CHECKSUM_PARTIAL) memcpy(to, rfv->hdr.c + offset, copy); else diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 761a473a07c5..0942784f3f8d 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -729,6 +729,7 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd, if (offset < rfv->hlen) { int copy = min(rfv->hlen - offset, len); + osb(); if (skb->ip_summed == CHECKSUM_PARTIAL) memcpy(to, rfv->c + offset, copy); else diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 8ca9915befc8..7f83abdea255 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) if (index < net->mpls.platform_labels) { struct mpls_route __rcu **platform_label = rcu_dereference(net->mpls.platform_label); + + osb(); rt = rcu_dereference(platform_label[index]); } return rt;
On Thu, Jan 04, 2018 at 02:15:53AM +0000, Alan Cox wrote: > > > > Elena has done the work of auditing static analysis reports to a dozen > > > or so locations that need some 'nospec' handling. > > > > How exactly is that related (especially in longer-term support terms) to > > BPF anyway? > > If you read the papers you need a very specific construct in order to not > only cause a speculative load of an address you choose but also to then > manage to cause a second operation that in some way reveals bits of data > or allows you to ask questions. > > BPF allows you to construct those sequences relatively easily and it's > the one case where a user space application can fairly easily place code > it wants to execute in the kernel. Without BPF you have to find the right > construct in the kernel, prime all the right predictions and measure the > result without getting killed off. There are places you can do that but > they are not so easy and we don't (at this point) think there are that > many. for BPF in particular we're thinking to do a different fix. Instead of killing speculation we can let cpu speculate. The fix will include rounding up bpf maps to nearest power of 2 and inserting bpf_and operation on the index after bounds check, so cpu can continue speculate beyond bounds check, but it will load from zero-ed memory. So this nospec arch dependent hack won't be necessary for bpf side, but may still be needed in other parts of the kernel. Also note that variant 1 is talking about exploiting prog_array bpf feature which had 64-bit access prior to commit 90caccdd8cc0 ("bpf: fix bpf_tail_call() x64 JIT") That was a fix for JIT and not related to this cpu issue, but I believe it breaks the existing exploit. Since it's not clear whether it's still possible to use bpf with 32-bit speculation only, we're going to do this rounding fix for unpriv just in case.
On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: > diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h > index 1c65817673db..dbc12007da51 100644 > --- a/include/linux/fdtable.h > +++ b/include/linux/fdtable.h > @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i > { > struct fdtable *fdt = rcu_dereference_raw(files->fdt); > > - if (fd < fdt->max_fds) > + if (fd < fdt->max_fds) { > + osb(); > return rcu_dereference_raw(fdt->fd[fd]); > + } > return NULL; > } ... and the point of that would be? Possibly revealing the value of files->fdt? Why would that be a threat, assuming you manage to extract the information in question in the first place?
"Williams, Dan J" <dan.j.williams@intel.com> writes: > Note that these are "a human looked at static analysis reports and > could not rationalize that these are false positives". Specific domain > knowledge about these paths may find that some of them are indeed false > positives. > > The change to m_start in kernel/user_namespace.c is interesting because > that's an example where the nospec_load() approach by itself would need > to barrier speculation twice whereas if_nospec can do it once for the > whole block. This user_namespace.c change is very convoluted for what it is trying to do. It simplifies to a one liner that just adds osb() after pos >= extents. AKA: if (pos >= extents) return NULL; + osb(); Is the intent to hide which branch branch we take based on extents, after the pos check? I suspect this implies that using a user namespace and a crafted uid map you can hit this in stat, on the fast path. At which point I suspect we will be better off extending struct user_namespace by a few pointers, so there is no union and remove the need for blocking speculation entirely. > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 246d4d4ce5c7..aa0be8cef2d4 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos, > { > loff_t pos = *ppos; > unsigned extents = map->nr_extents; > - smp_rmb(); > > - if (pos >= extents) > - return NULL; > + /* paired with smp_wmb in map_write */ > + smp_rmb(); > > - if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > - return &map->extent[pos]; > + if (pos < extents) { > + osb(); > + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > + return &map->extent[pos]; > + return &map->forward[pos]; > + } > > - return &map->forward[pos]; > + return NULL; > } > > static void *uid_m_start(struct seq_file *seq, loff_t *ppos) > diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c > index 8ca9915befc8..7f83abdea255 100644 > --- a/net/mpls/af_mpls.c > +++ b/net/mpls/af_mpls.c > @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) > if (index < net->mpls.platform_labels) { > struct mpls_route __rcu **platform_label = > rcu_dereference(net->mpls.platform_label); > + > + osb(); > rt = rcu_dereference(platform_label[index]); > } > return rt; Ouch! This adds a barrier in the middle of an rcu lookup, on the fast path for routing mpls packets. Which if memory serves will noticably slow down software processing of mpls packets. Why does osb() fall after the branch for validity? So that we allow speculation up until then? I suspect it would be better to have those barriers in the tun/tap interfaces where userspace can inject packets and thus time them. Then the code could still speculate and go fast for remote packets. Or does the speculation stomping have to be immediately at the place where we use data from userspace to perform a table lookup? Eric p.s. osb() seems to be mispelled. obs() would make much more sense.
On Wed, Jan 3, 2018 at 8:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: > >> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h >> index 1c65817673db..dbc12007da51 100644 >> --- a/include/linux/fdtable.h >> +++ b/include/linux/fdtable.h >> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i >> { >> struct fdtable *fdt = rcu_dereference_raw(files->fdt); >> >> - if (fd < fdt->max_fds) >> + if (fd < fdt->max_fds) { >> + osb(); >> return rcu_dereference_raw(fdt->fd[fd]); >> + } >> return NULL; >> } > > ... and the point of that would be? Possibly revealing the value of files->fdt? > Why would that be a threat, assuming you manage to extract the information in > question in the first place? No, the concern is that an fd value >= fdt->max_fds may cause the cpu to read arbitrary memory addresses relative to files->fdt and userspace can observe that it got loaded. With the barrier the speculation stops and never allows that speculative read to issue. With the change, the cpu only issues a read for fdt->fd[fd] when fd is valid.
On 01/03/2018 09:44 PM, Dan Williams wrote: > No, the concern is that an fd value >= fdt->max_fds may cause the cpu > to read arbitrary memory addresses relative to files->fdt and > userspace can observe that it got loaded. Yep, it potentially tells you someting about memory after fdt->fd[]. For instance, you might be able to observe if some random bit of memory after the actual fd[] array had 'mask' set because the CPU is running this code with a 'file' that actually fails the "fd < fdt->max_fds" check: file = __fcheck_files(files, fd); if (!file || unlikely(file->f_mode & mask)) return 0; return (unsigned long)file;
On Wed, Jan 03, 2018 at 09:44:33PM -0800, Dan Williams wrote: > On Wed, Jan 3, 2018 at 8:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: > > > >> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h > >> index 1c65817673db..dbc12007da51 100644 > >> --- a/include/linux/fdtable.h > >> +++ b/include/linux/fdtable.h > >> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i > >> { > >> struct fdtable *fdt = rcu_dereference_raw(files->fdt); > >> > >> - if (fd < fdt->max_fds) > >> + if (fd < fdt->max_fds) { > >> + osb(); > >> return rcu_dereference_raw(fdt->fd[fd]); > >> + } > >> return NULL; > >> } > > > > ... and the point of that would be? Possibly revealing the value of files->fdt? > > Why would that be a threat, assuming you manage to extract the information in > > question in the first place? > > No, the concern is that an fd value >= fdt->max_fds may cause the cpu > to read arbitrary memory addresses relative to files->fdt and > userspace can observe that it got loaded. Yes. And all that might reveal is the value of files->fdt. Who cares?
On Thu, Jan 04, 2018 at 05:50:12AM +0000, Al Viro wrote: > On Wed, Jan 03, 2018 at 09:44:33PM -0800, Dan Williams wrote: > > On Wed, Jan 3, 2018 at 8:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: > > > > > >> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h > > >> index 1c65817673db..dbc12007da51 100644 > > >> --- a/include/linux/fdtable.h > > >> +++ b/include/linux/fdtable.h > > >> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i > > >> { > > >> struct fdtable *fdt = rcu_dereference_raw(files->fdt); > > >> > > >> - if (fd < fdt->max_fds) > > >> + if (fd < fdt->max_fds) { > > >> + osb(); > > >> return rcu_dereference_raw(fdt->fd[fd]); > > >> + } > > >> return NULL; > > >> } > > > > > > ... and the point of that would be? Possibly revealing the value of files->fdt? > > > Why would that be a threat, assuming you manage to extract the information in > > > question in the first place? > > > > No, the concern is that an fd value >= fdt->max_fds may cause the cpu > > to read arbitrary memory addresses relative to files->fdt and > > userspace can observe that it got loaded. > > Yes. And all that might reveal is the value of files->fdt. Who cares? Sorry, s/files->fdt/files->fdt->fd/. Still the same question - what information would that extract and how would attacker use that?
On Wed, 3 Jan 2018, Dan Williams wrote: > [ adding Julia and Dan ] > > On Wed, Jan 3, 2018 at 5:07 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: > > On Wed, 3 Jan 2018 16:39:31 -0800 > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > >> On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: > >> > The 'if_nospec' primitive marks locations where the kernel is disabling > >> > speculative execution that could potentially access privileged data. It > >> > is expected to be paired with a 'nospec_{ptr,load}' where the user > >> > controlled value is actually consumed. > >> > >> I'm much less worried about these "nospec_load/if" macros, than I am > >> about having a sane way to determine when they should be needed. > >> > >> Is there such a sane model right now, or are we talking "people will > >> randomly add these based on strong feelings"? > > > > There are people trying to tune coverity and other tool rules to identify > > cases, and some of the work so far was done that way. For x86 we didn't > > find too many so far so either the needed pattern is uncommon or .... 8) > > > > Given you can execute over a hundred basic instructions in a speculation > > window it does need to be a tool that can explore not just in function > > but across functions. That's really tough for the compiler itself to do > > without help. > > > > What remains to be seen is if there are other patterns that affect > > different processors. > > > > In the longer term the compiler itself needs to know what is and isn't > > safe (ie you need to be able to write things like > > > > void foo(tainted __user int *x) > > > > and have the compiler figure out what level of speculation it can do and > > (on processors with those features like IA64) when it can and can't do > > various kinds of non-trapping loads. > > > > It would be great if coccinelle and/or smatch could be taught to catch > some of these case at least as a first pass "please audit this code > block" type of notification. > What should one be looking for. Do you have a typical example? julia
On Wed, Jan 3, 2018 at 9:01 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > "Williams, Dan J" <dan.j.williams@intel.com> writes: > > > >> Note that these are "a human looked at static analysis reports and >> could not rationalize that these are false positives". Specific domain >> knowledge about these paths may find that some of them are indeed false >> positives. >> >> The change to m_start in kernel/user_namespace.c is interesting because >> that's an example where the nospec_load() approach by itself would need >> to barrier speculation twice whereas if_nospec can do it once for the >> whole block. > > > This user_namespace.c change is very convoluted for what it is trying to > do. Sorry this was my rebase on top of commit d5e7b3c5f51f "userns: Don't read extents twice in m_start" the original change from Elena was simpler. Part of the complexity arises from converting the common kernel pattern of if (<invalid condition>) return NULL; do_stuff; ...to: if (<valid conidtion>) { barrier(); do_stuff; } > It simplifies to a one liner that just adds osb() after pos >= > extents. AKA: > > if (pos >= extents) > return NULL; > + osb(); > > Is the intent to hide which branch branch we take based on extents, > after the pos check? The intent is to prevent speculative execution from triggering any reads when 'pos' is invalid. > I suspect this implies that using a user namespace and a crafted uid > map you can hit this in stat, on the fast path. > > At which point I suspect we will be better off extending struct > user_namespace by a few pointers, so there is no union and remove the > need for blocking speculation entirely. How does this help prevent a speculative read with an invalid 'pos' reading arbitrary kernel addresses? > >> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c >> index 246d4d4ce5c7..aa0be8cef2d4 100644 >> --- a/kernel/user_namespace.c >> +++ b/kernel/user_namespace.c >> @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos, >> { >> loff_t pos = *ppos; >> unsigned extents = map->nr_extents; >> - smp_rmb(); >> >> - if (pos >= extents) >> - return NULL; >> + /* paired with smp_wmb in map_write */ >> + smp_rmb(); >> >> - if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) >> - return &map->extent[pos]; >> + if (pos < extents) { >> + osb(); >> + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) >> + return &map->extent[pos]; >> + return &map->forward[pos]; >> + } >> >> - return &map->forward[pos]; >> + return NULL; >> } >> >> static void *uid_m_start(struct seq_file *seq, loff_t *ppos) > > > >> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c >> index 8ca9915befc8..7f83abdea255 100644 >> --- a/net/mpls/af_mpls.c >> +++ b/net/mpls/af_mpls.c >> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) >> if (index < net->mpls.platform_labels) { >> struct mpls_route __rcu **platform_label = >> rcu_dereference(net->mpls.platform_label); >> + >> + osb(); >> rt = rcu_dereference(platform_label[index]); >> } >> return rt; > > Ouch! This adds a barrier in the middle of an rcu lookup, on the > fast path for routing mpls packets. Which if memory serves will > noticably slow down software processing of mpls packets. > > Why does osb() fall after the branch for validity? So that we allow > speculation up until then? It falls there so that the cpu only issues reads with known good 'index' values. > I suspect it would be better to have those barriers in the tun/tap > interfaces where userspace can inject packets and thus time them. Then > the code could still speculate and go fast for remote packets. > > Or does the speculation stomping have to be immediately at the place > where we use data from userspace to perform a table lookup? The speculation stomping barrier has to be between where we validate the input and when we may speculate on invalid input. So, yes, moving the user controllable input validation earlier and out of the fast path would be preferred. Think of this patch purely as a static analysis warning that something might need to be done to resolve the report.
On Wed, Jan 3, 2018 at 9:55 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Jan 04, 2018 at 05:50:12AM +0000, Al Viro wrote: >> On Wed, Jan 03, 2018 at 09:44:33PM -0800, Dan Williams wrote: >> > On Wed, Jan 3, 2018 at 8:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> > > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: >> > > >> > >> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h >> > >> index 1c65817673db..dbc12007da51 100644 >> > >> --- a/include/linux/fdtable.h >> > >> +++ b/include/linux/fdtable.h >> > >> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i >> > >> { >> > >> struct fdtable *fdt = rcu_dereference_raw(files->fdt); >> > >> >> > >> - if (fd < fdt->max_fds) >> > >> + if (fd < fdt->max_fds) { >> > >> + osb(); >> > >> return rcu_dereference_raw(fdt->fd[fd]); >> > >> + } >> > >> return NULL; >> > >> } >> > > >> > > ... and the point of that would be? Possibly revealing the value of files->fdt? >> > > Why would that be a threat, assuming you manage to extract the information in >> > > question in the first place? >> > >> > No, the concern is that an fd value >= fdt->max_fds may cause the cpu >> > to read arbitrary memory addresses relative to files->fdt and >> > userspace can observe that it got loaded. >> >> Yes. And all that might reveal is the value of files->fdt. Who cares? > > Sorry, s/files->fdt/files->fdt->fd/. Still the same question - what information > would that extract and how would attacker use that? The question is if userspace can ex-filtrate any data from the kernel that would otherwise be blocked by a bounds check should the kernel close that hole? For these patches I do not think the bar should be "can I prove an information leak is exploitable" it should be "can I prove that a leak is not exploitable", especially when possibly combined with other leak sites.
> On Thu, Jan 04, 2018 at 02:15:53AM +0000, Alan Cox wrote: > > > > > > Elena has done the work of auditing static analysis reports to a dozen > > > > or so locations that need some 'nospec' handling. > > > > > > How exactly is that related (especially in longer-term support terms) to > > > BPF anyway? > > > > If you read the papers you need a very specific construct in order to not > > only cause a speculative load of an address you choose but also to then > > manage to cause a second operation that in some way reveals bits of data > > or allows you to ask questions. > > > > BPF allows you to construct those sequences relatively easily and it's > > the one case where a user space application can fairly easily place code > > it wants to execute in the kernel. Without BPF you have to find the right > > construct in the kernel, prime all the right predictions and measure the > > result without getting killed off. There are places you can do that but > > they are not so easy and we don't (at this point) think there are that > > many. > > for BPF in particular we're thinking to do a different fix. > Instead of killing speculation we can let cpu speculate. > The fix will include rounding up bpf maps to nearest power of 2 and > inserting bpf_and operation on the index after bounds check, > so cpu can continue speculate beyond bounds check, but it will > load from zero-ed memory. > So this nospec arch dependent hack won't be necessary for bpf side, > but may still be needed in other parts of the kernel. I think this is a much better approach than what we have been doing internally so far. My initial fix back in August for this was to insert a minimal set of lfence barriers (osb() barrier below) that prevent speculation just based on how attack was using constants to index eBPF maps: diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c #define BPF_R0 regs[BPF_REG_0] @@ -939,6 +940,7 @@ static unsigned int ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, DST = IMM; CONT; LD_IMM_DW: + osb(); DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32; insn++; CONT; @@ -1200,6 +1202,7 @@ static unsigned int ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, *(SIZE *)(unsigned long) (DST + insn->off) = IMM; \ CONT; \ LDX_MEM_##SIZEOP: \ + osb(); \ DST = *(SIZE *)(unsigned long) (SRC + insn->off); \ CONT; And similar stuff also for x86 JIT (blinding dependent): @@ -400,7 +413,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, case BPF_ADD: b2 = 0x01; break; case BPF_SUB: b2 = 0x29; break; case BPF_AND: b2 = 0x21; break; - case BPF_OR: b2 = 0x09; break; + case BPF_OR: b2 = 0x09; emit_memory_barrier(&prog); break; case BPF_XOR: b2 = 0x31; break; } if (BPF_CLASS(insn->code) == BPF_ALU64) @@ -647,6 +660,16 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, case BPF_ALU64 | BPF_RSH | BPF_X: case BPF_ALU64 | BPF_ARSH | BPF_X: + /* If blinding is enabled, each + * BPF_LD | BPF_IMM | BPF_DW instruction + * is converted to 4 eBPF instructions with + * BPF_ALU64_IMM(BPF_LSH, BPF_REG_AX, 32) + * always present(number 3). Detect such cases + * and insert memory barriers. */ + if ((BPF_CLASS(insn->code) == BPF_ALU64) + && (BPF_OP(insn->code) == BPF_LSH) + && (src_reg == BPF_REG_AX)) + emit_memory_barrier(&prog); /* check for bad case when dst_reg == rcx */ if (dst_reg == BPF_REG_4) { /* mov r11, dst_reg */ But this is really ugly fix IMO to prevent speculation from happen, so with your approach I think it is really much better. If you need help in testing the fixes, I can do it unless you already have the attack code yourself. > > Also note that variant 1 is talking about exploiting prog_array > bpf feature which had 64-bit access prior to > commit 90caccdd8cc0 ("bpf: fix bpf_tail_call() x64 JIT") > That was a fix for JIT and not related to this cpu issue, but > I believe it breaks the existing exploit. Actually I could not get existing exploit working on anything past 4.11 but at that time I could not see any fundamental change in eBPF that would prevent it. Best Regards, Elena.
Hi Dan, Thanks for these examples. On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: > Note, the following is Elena's work, I'm just helping poke the upstream > discussion along while she's offline. > > Elena audited the static analysis reports down to the following > locations where userspace provides a value for a conditional branch and > then later creates a dependent load on that same userspace controlled > value. > > 'osb()', observable memory barrier, resolves to an lfence on x86. My > concern with these changes is that it is not clear what content within > the branch block is of concern. Peter's 'if_nospec' proposal combined > with Mark's 'nospec_load' would seem to clear that up, from a self > documenting code perspective, and let archs optionally protect entire > conditional blocks or individual loads within those blocks. I'm a little concerned by having to use two helpers that need to be paired. It means that we have to duplicate the bounds information, and I suspect in practice that they'll often be paired improperly. I think that we can avoid those problems if we use nospec_ptr() on its own. It returns NULL if the pointer would be out-of-bounds, so we can use it in the if-statement. On x86, that can contain the bounds checks followed be an OSB / lfence, like if_nospec(). On arm we can use our architected sequence. In either case, nospec_ptr() returns NULL if the pointer would be out-of-bounds. Then, rather than sequences like: if_nospec(idx < max) { val = nospec_array_load(array, idx, max); ... } ... we could have: if ((elem_ptr = nospec_array_ptr(array, idx, max)) { val = *elem_ptr; ... } ... which also means we don't have to annotate every single load in the branch if the element is a structure with multiple fields. Does that sound workable to you? From a quick scan, it looks like it would fit all of the example cases below. Thanks, Mark. > Note that these are "a human looked at static analysis reports and > could not rationalize that these are false positives". Specific domain > knowledge about these paths may find that some of them are indeed false > positives. > > The change to m_start in kernel/user_namespace.c is interesting because > that's an example where the nospec_load() approach by itself would need > to barrier speculation twice whereas if_nospec can do it once for the > whole block. > > [ forgive any whitespace damage ] > > 8<--- > diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c > index 3e7e283a44a8..65175bbe805f 100644 > --- a/drivers/media/usb/uvc/uvc_v4l2.c > +++ b/drivers/media/usb/uvc/uvc_v4l2.c > @@ -821,6 +821,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh, > } > pin = iterm->id; > } else if (index < selector->bNrInPins) { > + osb(); > pin = selector->baSourceID[index]; > list_for_each_entry(iterm, &chain->entities, chain) { > if (!UVC_ENTITY_IS_ITERM(iterm)) > diff --git a/drivers/net/wireless/ath/carl9170/main.c b/drivers/net/wireless/ath/carl9170/main.c > index 988c8857d78c..cf267b709af6 100644 > --- a/drivers/net/wireless/ath/carl9170/main.c > +++ b/drivers/net/wireless/ath/carl9170/main.c > @@ -1388,6 +1388,7 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw, > > mutex_lock(&ar->mutex); > if (queue < ar->hw->queues) { > + osb(); > memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param)); > ret = carl9170_set_qos(ar); > } else { > diff --git a/drivers/net/wireless/intersil/p54/main.c b/drivers/net/wireless/intersil/p54/main.c > index ab6d39e12069..f024f1ad81ff 100644 > --- a/drivers/net/wireless/intersil/p54/main.c > +++ b/drivers/net/wireless/intersil/p54/main.c > @@ -415,6 +415,7 @@ static int p54_conf_tx(struct ieee80211_hw *dev, > > mutex_lock(&priv->conf_mutex); > if (queue < dev->queues) { > + osb(); > P54_SET_QUEUE(priv->qos_params[queue], params->aifs, > params->cw_min, params->cw_max, params->txop); > ret = p54_set_edcf(priv); > diff --git a/drivers/net/wireless/st/cw1200/sta.c b/drivers/net/wireless/st/cw1200/sta.c > index 38678e9a0562..b4a2a7ba04e8 100644 > --- a/drivers/net/wireless/st/cw1200/sta.c > +++ b/drivers/net/wireless/st/cw1200/sta.c > @@ -619,6 +619,7 @@ int cw1200_conf_tx(struct ieee80211_hw *dev, struct ieee80211_vif *vif, > mutex_lock(&priv->conf_mutex); > > if (queue < dev->queues) { > + osb(); > old_uapsd_flags = le16_to_cpu(priv->uapsd_info.uapsd_flags); > > WSM_TX_QUEUE_SET(&priv->tx_queue_params, queue, 0, 0, 0); > diff --git a/drivers/scsi/qla2xxx/qla_mr.c b/drivers/scsi/qla2xxx/qla_mr.c > index d5da3981cefe..dec8b6e087e3 100644 > --- a/drivers/scsi/qla2xxx/qla_mr.c > +++ b/drivers/scsi/qla2xxx/qla_mr.c > @@ -2304,10 +2304,12 @@ qlafx00_status_entry(scsi_qla_host_t *vha, struct rsp_que *rsp, void *pkt) > req = ha->req_q_map[que]; > > /* Validate handle. */ > - if (handle < req->num_outstanding_cmds) > + if (handle < req->num_outstanding_cmds) { > + osb(); > sp = req->outstanding_cmds[handle]; > - else > + } else { > sp = NULL; > + } > > if (sp == NULL) { > ql_dbg(ql_dbg_io, vha, 0x3034, > @@ -2655,10 +2657,12 @@ qlafx00_multistatus_entry(struct scsi_qla_host *vha, > req = ha->req_q_map[que]; > > /* Validate handle. */ > - if (handle < req->num_outstanding_cmds) > + if (handle < req->num_outstanding_cmds) { > + osb(); > sp = req->outstanding_cmds[handle]; > - else > + } else { > sp = NULL; > + } > > if (sp == NULL) { > ql_dbg(ql_dbg_io, vha, 0x3044, > diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c > index 145a5c53ff5c..d732b34e120d 100644 > --- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c > +++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c > @@ -57,15 +57,16 @@ static int int340x_thermal_get_trip_temp(struct thermal_zone_device *zone, > if (d->override_ops && d->override_ops->get_trip_temp) > return d->override_ops->get_trip_temp(zone, trip, temp); > > - if (trip < d->aux_trip_nr) > + if (trip < d->aux_trip_nr) { > + osb(); > *temp = d->aux_trips[trip]; > - else if (trip == d->crt_trip_id) > + } else if (trip == d->crt_trip_id) { > *temp = d->crt_temp; > - else if (trip == d->psv_trip_id) > + } else if (trip == d->psv_trip_id) { > *temp = d->psv_temp; > - else if (trip == d->hot_trip_id) > + } else if (trip == d->hot_trip_id) { > *temp = d->hot_temp; > - else { > + } else { > for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT; i++) { > if (d->act_trips[i].valid && > d->act_trips[i].id == trip) { > diff --git a/fs/udf/misc.c b/fs/udf/misc.c > index 401e64cde1be..c5394760d26b 100644 > --- a/fs/udf/misc.c > +++ b/fs/udf/misc.c > @@ -104,6 +104,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size, > iinfo->i_lenEAttr) { > uint32_t aal = > le32_to_cpu(eahd->appAttrLocation); > + > + osb(); > memmove(&ea[offset - aal + size], > &ea[aal], offset - aal); > offset -= aal; > @@ -114,6 +116,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size, > iinfo->i_lenEAttr) { > uint32_t ial = > le32_to_cpu(eahd->impAttrLocation); > + > + osb(); > memmove(&ea[offset - ial + size], > &ea[ial], offset - ial); > offset -= ial; > @@ -125,6 +129,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size, > iinfo->i_lenEAttr) { > uint32_t aal = > le32_to_cpu(eahd->appAttrLocation); > + > + osb(); > memmove(&ea[offset - aal + size], > &ea[aal], offset - aal); > offset -= aal; > diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h > index 1c65817673db..dbc12007da51 100644 > --- a/include/linux/fdtable.h > +++ b/include/linux/fdtable.h > @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i > { > struct fdtable *fdt = rcu_dereference_raw(files->fdt); > > - if (fd < fdt->max_fds) > + if (fd < fdt->max_fds) { > + osb(); > return rcu_dereference_raw(fdt->fd[fd]); > + } > return NULL; > } > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 246d4d4ce5c7..aa0be8cef2d4 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos, > { > loff_t pos = *ppos; > unsigned extents = map->nr_extents; > - smp_rmb(); > > - if (pos >= extents) > - return NULL; > + /* paired with smp_wmb in map_write */ > + smp_rmb(); > > - if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > - return &map->extent[pos]; > + if (pos < extents) { > + osb(); > + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) > + return &map->extent[pos]; > + return &map->forward[pos]; > + } > > - return &map->forward[pos]; > + return NULL; > } > > static void *uid_m_start(struct seq_file *seq, loff_t *ppos) > diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c > index 125c1eab3eaa..56909c8fa025 100644 > --- a/net/ipv4/raw.c > +++ b/net/ipv4/raw.c > @@ -476,6 +476,7 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd, > if (offset < rfv->hlen) { > int copy = min(rfv->hlen - offset, len); > > + osb(); > if (skb->ip_summed == CHECKSUM_PARTIAL) > memcpy(to, rfv->hdr.c + offset, copy); > else > diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c > index 761a473a07c5..0942784f3f8d 100644 > --- a/net/ipv6/raw.c > +++ b/net/ipv6/raw.c > @@ -729,6 +729,7 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd, > if (offset < rfv->hlen) { > int copy = min(rfv->hlen - offset, len); > > + osb(); > if (skb->ip_summed == CHECKSUM_PARTIAL) > memcpy(to, rfv->c + offset, copy); > else > diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c > index 8ca9915befc8..7f83abdea255 100644 > --- a/net/mpls/af_mpls.c > +++ b/net/mpls/af_mpls.c > @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) > if (index < net->mpls.platform_labels) { > struct mpls_route __rcu **platform_label = > rcu_dereference(net->mpls.platform_label); > + > + osb(); > rt = rcu_dereference(platform_label[index]); > } > return rt;
Dan Williams <dan.j.williams@intel.com> writes: > On Wed, Jan 3, 2018 at 9:01 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> "Williams, Dan J" <dan.j.williams@intel.com> writes: >> >> >> >>> Note that these are "a human looked at static analysis reports and >>> could not rationalize that these are false positives". Specific domain >>> knowledge about these paths may find that some of them are indeed false >>> positives. >>> >>> The change to m_start in kernel/user_namespace.c is interesting because >>> that's an example where the nospec_load() approach by itself would need >>> to barrier speculation twice whereas if_nospec can do it once for the >>> whole block. >> >> >> This user_namespace.c change is very convoluted for what it is trying to >> do. > > Sorry this was my rebase on top of commit d5e7b3c5f51f "userns: Don't > read extents twice in m_start" the original change from Elena was > simpler. Part of the complexity arises from converting the common > kernel pattern of > > if (<invalid condition>) > return NULL; > do_stuff; > > ...to: > > if (<valid conidtion>) { > barrier(); > do_stuff; > } > >> It simplifies to a one liner that just adds osb() after pos >= >> extents. AKA: >> >> if (pos >= extents) >> return NULL; >> + osb(); >> >> Is the intent to hide which branch branch we take based on extents, >> after the pos check? > > The intent is to prevent speculative execution from triggering any > reads when 'pos' is invalid. If that is the intent I think the patch you posted is woefully inadequate. We have many many more seq files in proc than just /proc/<pid>/uid_map. >> I suspect this implies that using a user namespace and a crafted uid >> map you can hit this in stat, on the fast path. >> >> At which point I suspect we will be better off extending struct >> user_namespace by a few pointers, so there is no union and remove the >> need for blocking speculation entirely. > > How does this help prevent a speculative read with an invalid 'pos' > reading arbitrary kernel addresses? I though the concern was extents. I am now convinced that collectively we need a much better description of the problem than currently exists. Either the patch you presented missed a whole lot like 90%+ of the user/kernel interface or there is some mitigating factor that I am not seeing. Either way until reasonable people can read the code and agree on the potential exploitability of it, I will be nacking these patches. >>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c >>> index 246d4d4ce5c7..aa0be8cef2d4 100644 >>> --- a/kernel/user_namespace.c >>> +++ b/kernel/user_namespace.c >>> @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos, >>> { >>> loff_t pos = *ppos; >>> unsigned extents = map->nr_extents; >>> - smp_rmb(); >>> >>> - if (pos >= extents) >>> - return NULL; >>> + /* paired with smp_wmb in map_write */ >>> + smp_rmb(); >>> >>> - if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) >>> - return &map->extent[pos]; >>> + if (pos < extents) { >>> + osb(); >>> + if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS) >>> + return &map->extent[pos]; >>> + return &map->forward[pos]; >>> + } >>> >>> - return &map->forward[pos]; >>> + return NULL; >>> } >>> >>> static void *uid_m_start(struct seq_file *seq, loff_t *ppos) >> >> >> >>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c >>> index 8ca9915befc8..7f83abdea255 100644 >>> --- a/net/mpls/af_mpls.c >>> +++ b/net/mpls/af_mpls.c >>> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) >>> if (index < net->mpls.platform_labels) { >>> struct mpls_route __rcu **platform_label = >>> rcu_dereference(net->mpls.platform_label); >>> + >>> + osb(); >>> rt = rcu_dereference(platform_label[index]); >>> } >>> return rt; >> >> Ouch! This adds a barrier in the middle of an rcu lookup, on the >> fast path for routing mpls packets. Which if memory serves will >> noticably slow down software processing of mpls packets. >> >> Why does osb() fall after the branch for validity? So that we allow >> speculation up until then? > > It falls there so that the cpu only issues reads with known good 'index' values. > >> I suspect it would be better to have those barriers in the tun/tap >> interfaces where userspace can inject packets and thus time them. Then >> the code could still speculate and go fast for remote packets. >> >> Or does the speculation stomping have to be immediately at the place >> where we use data from userspace to perform a table lookup? > > The speculation stomping barrier has to be between where we validate > the input and when we may speculate on invalid input. So a serializing instruction at the kernel/user boundary (like say loading cr3) is not enough? That would seem to break any chance of a controlled timing. > So, yes, moving > the user controllable input validation earlier and out of the fast > path would be preferred. Think of this patch purely as a static > analysis warning that something might need to be done to resolve the > report. That isn't what I was suggesting. I was just suggesting a serialization instruction earlier in the pipeline. Given what I have seen in other parts of the thread I think an and instruction that just limits the index to a sane range is generally applicable, and should be cheap enough to not care about. Further it seems to apply to the pattern the static checkers were catching, so I suspect that is the pattern we want to stress for limiting speculation. Assuming of course the compiler won't just optimize the and of the index out. Eric
On Thu, Jan 04, 2018 at 08:54:11AM -0600, Eric W. Biederman wrote: > Dan Williams <dan.j.williams@intel.com> writes: > > On Wed, Jan 3, 2018 at 9:01 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > >> "Williams, Dan J" <dan.j.williams@intel.com> writes: > Either the patch you presented missed a whole lot like 90%+ of the > user/kernel interface or there is some mitigating factor that I am not > seeing. Either way until reasonable people can read the code and > agree on the potential exploitability of it, I will be nacking these > patches. As Dan mentioned, this is the result of auditing some static analysis reports. I don't think it was claimed that this was complete, just that these are locations that we're fairly certain need attention. Auditing the entire user/kernel interface is going to take time, and I don't think we should ignore this corpus in the mean time (though we certainly want to avoid a whack-a-mole game). [...] > >>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c > >>> index 8ca9915befc8..7f83abdea255 100644 > >>> --- a/net/mpls/af_mpls.c > >>> +++ b/net/mpls/af_mpls.c > >>> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) > >>> if (index < net->mpls.platform_labels) { > >>> struct mpls_route __rcu **platform_label = > >>> rcu_dereference(net->mpls.platform_label); > >>> + > >>> + osb(); > >>> rt = rcu_dereference(platform_label[index]); > >>> } > >>> return rt; > >> > >> Ouch! This adds a barrier in the middle of an rcu lookup, on the > >> fast path for routing mpls packets. Which if memory serves will > >> noticably slow down software processing of mpls packets. > >> > >> Why does osb() fall after the branch for validity? So that we allow > >> speculation up until then? > > > > It falls there so that the cpu only issues reads with known good 'index' values. > > > >> I suspect it would be better to have those barriers in the tun/tap > >> interfaces where userspace can inject packets and thus time them. Then > >> the code could still speculate and go fast for remote packets. > >> > >> Or does the speculation stomping have to be immediately at the place > >> where we use data from userspace to perform a table lookup? > > > > The speculation stomping barrier has to be between where we validate > > the input and when we may speculate on invalid input. > > So a serializing instruction at the kernel/user boundary (like say > loading cr3) is not enough? That would seem to break any chance of a > controlled timing. Unfortunately, it isn't sufficient to do this at the kernel/user boundary. Any subsequent bounds check can be mis-speculated regardless of prior serialization. Such serialization has to occur *after* the relevant bounds check, but *before* use of the value that was checked. Where it's possible to audit user-provided values up front, we may be able to batch checks to amortize the cost of such serialization, but typically bounds checks are spread arbitrarily deep in the kernel. [...] > Given what I have seen in other parts of the thread I think an and > instruction that just limits the index to a sane range is generally > applicable, and should be cheap enough to not care about. Where feasible, this sounds good to me. However, since many places have dynamic bounds which aren't necessarily powers-of-two, I'm not sure how applicable this is. Thanks, Mark.
On Wed, Jan 3, 2018 at 10:28 PM, Julia Lawall <julia.lawall@lip6.fr> wrote: > > > On Wed, 3 Jan 2018, Dan Williams wrote: > >> [ adding Julia and Dan ] >> >> On Wed, Jan 3, 2018 at 5:07 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: >> > On Wed, 3 Jan 2018 16:39:31 -0800 >> > Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > >> >> On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> >> > The 'if_nospec' primitive marks locations where the kernel is disabling >> >> > speculative execution that could potentially access privileged data. It >> >> > is expected to be paired with a 'nospec_{ptr,load}' where the user >> >> > controlled value is actually consumed. >> >> >> >> I'm much less worried about these "nospec_load/if" macros, than I am >> >> about having a sane way to determine when they should be needed. >> >> >> >> Is there such a sane model right now, or are we talking "people will >> >> randomly add these based on strong feelings"? >> > >> > There are people trying to tune coverity and other tool rules to identify >> > cases, and some of the work so far was done that way. For x86 we didn't >> > find too many so far so either the needed pattern is uncommon or .... 8) >> > >> > Given you can execute over a hundred basic instructions in a speculation >> > window it does need to be a tool that can explore not just in function >> > but across functions. That's really tough for the compiler itself to do >> > without help. >> > >> > What remains to be seen is if there are other patterns that affect >> > different processors. >> > >> > In the longer term the compiler itself needs to know what is and isn't >> > safe (ie you need to be able to write things like >> > >> > void foo(tainted __user int *x) >> > >> > and have the compiler figure out what level of speculation it can do and >> > (on processors with those features like IA64) when it can and can't do >> > various kinds of non-trapping loads. >> > >> >> It would be great if coccinelle and/or smatch could be taught to catch >> some of these case at least as a first pass "please audit this code >> block" type of notification. >> > > What should one be looking for. Do you have a typical example? > See "Exploiting Conditional Branch Misprediction" from the paper [1]. The typical example is an attacker controlled index used to trigger a dependent read near a branch. Where an example of "near" from the paper is "up to 188 simple instructions inserted in the source code between the ‘if’ statement and the line accessing array...". if (attacker_controlled_index < bound) val = array[attacker_controlled_index]; else return error; ...when the cpu speculates that the 'index < bound' branch is taken it reads index and uses that value to read array[index]. The result of an 'array' relative read is potentially observable in the cache. [1]: https://spectreattack.com/spectre.pdf
Hi! > >> > What remains to be seen is if there are other patterns that affect > >> > different processors. > >> > > >> > In the longer term the compiler itself needs to know what is and isn't > >> > safe (ie you need to be able to write things like > >> > > >> > void foo(tainted __user int *x) > >> > > >> > and have the compiler figure out what level of speculation it can do and > >> > (on processors with those features like IA64) when it can and can't do > >> > various kinds of non-trapping loads. > >> > > >> > >> It would be great if coccinelle and/or smatch could be taught to catch > >> some of these case at least as a first pass "please audit this code > >> block" type of notification. > >> > > > > What should one be looking for. Do you have a typical example? > > > > See "Exploiting Conditional Branch Misprediction" from the paper [1]. > > The typical example is an attacker controlled index used to trigger a > dependent read near a branch. Where an example of "near" from the > paper is "up to 188 simple instructions inserted in the source code > between the ‘if’ statement and the line accessing array...". > > if (attacker_controlled_index < bound) > val = array[attacker_controlled_index]; > else > return error; > > ...when the cpu speculates that the 'index < bound' branch is taken it > reads index and uses that value to read array[index]. The result of an > 'array' relative read is potentially observable in the cache. You still need (void) array2[val]; after that to get something observable, right? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Thu 2018-01-04 02:47:51, Jiri Kosina wrote: > On Thu, 4 Jan 2018, Alan Cox wrote: > > > > If the CPU speculation can cause these kinds of side-effects, it just must > > > not happen, full stop. > > > > At which point your performance will resemble that of a 2012 atom > > processor at best. > > You know what? I'd be completely fine with that, if it's traded for "my > ssh and internet banking keys are JUST MINE, ok?" :) Agreed. For kernel, we may be able to annonate "tainted" pointers. But then there's quite a lot of code in userspace... What will need to be modified? Just JITs? Setuid programs? And we can get part of the performance back by adding more of SMT... AFAICT. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> For kernel, we may be able to annonate "tainted" pointers. But then > there's quite a lot of code in userspace... What will need to be > modified? Just JITs? Setuid programs? You never go from one user process to another except via the kernel. We have no hardware scheduling going on. That means that if the kernel and/or CPU imposes the correct speculation barriers you can't attack anyone but yourself. Sandboxes and JITs are the critical components where the issue matters. That IMHO is also an argument for why once the basics are in place there is a good argument for a prctl and maybe cgroup support to run some groups of processes with different trust levels. For example there's not much point protecting a user process from root, or a process from another process that could use ptrace on it instead. > And we can get part of the performance back by adding more of > SMT... AFAICT. The current evidence is not because most workloads are not sufficiently parallelisable. Likewise doing the speculation in the compiler doesn't appear to have been the success people hoped for (IA64). Alan
On Thu, 4 Jan 2018, Alan Cox wrote: > You never go from one user process to another except via the kernel. We > have no hardware scheduling going on. That means that if the kernel > and/or CPU imposes the correct speculation barriers you can't attack > anyone but yourself. So how does this work on HT with the shared BTB? There is no context switch (and hence no IBPB) happening between the threads sharing it. -- Jiri Kosina SUSE Labs
Hi! > > So this is in that same category, but yes, it's inconvenient. > > Disagreed, violently. CPU has to execute the instructions I ask it to > execute, and if it executes *anything* else that reveals any information > about the instructions that have *not* been executed, it's flawed. I agree that's a flaw. Unfortunately... CPUs do execute instructions you did not ask them to execute all the time. Plus CPU designers forgot that cache state (and active row in DRAM) is actually observable side-effect. ....and that's where we are today. If you want, I have two systems with AMD Geode. One is PC. Neither is very fast. All the other general purpose CPUs I have -- and that includes smartphones -- are likely out-of-order, and that means flawed. So... situation is bad. CPUs do execute intruction you did not ask them to execute. I don't think that's reasonable to change. I believe "right" fix would be for CPUs to treat even DRAM read as side-effects, and adjust speculation accordingly. I'm not sure Intel/AMD is going to do the right thing here. Oh, I have an FPGA, too, if you want to play with RISC-V :-). Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Hi! > > It falls there so that the cpu only issues reads with known good 'index' values. > > > >> I suspect it would be better to have those barriers in the tun/tap > >> interfaces where userspace can inject packets and thus time them. Then > >> the code could still speculate and go fast for remote packets. > >> > >> Or does the speculation stomping have to be immediately at the place > >> where we use data from userspace to perform a table lookup? > > > > The speculation stomping barrier has to be between where we validate > > the input and when we may speculate on invalid input. > > So a serializing instruction at the kernel/user boundary (like say > loading cr3) is not enough? That would seem to break any chance of a > controlled timing. Timing attack is not as straightforward as this. You can assume cache snooping from second CPU _while_ kernel is executing. Yes, that will mean timing, but.... Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Thu, 4 Jan 2018 21:39:24 +0100 (CET) Jiri Kosina <jikos@kernel.org> wrote: > On Thu, 4 Jan 2018, Alan Cox wrote: > > > You never go from one user process to another except via the kernel. We > > have no hardware scheduling going on. That means that if the kernel > > and/or CPU imposes the correct speculation barriers you can't attack > > anyone but yourself. > > So how does this work on HT with the shared BTB? There is no context > switch (and hence no IBPB) happening between the threads sharing it. > If you are paranoid in that case you either need to schedule things that trust each other together or disable the speculation while that situation occurs. However the kernel is always in the position to make that decision. Alan
On Thu, Jan 4, 2018 at 11:26 AM, Pavel Machek <pavel@ucw.cz> wrote: > Hi! > >> >> > What remains to be seen is if there are other patterns that affect >> >> > different processors. >> >> > >> >> > In the longer term the compiler itself needs to know what is and isn't >> >> > safe (ie you need to be able to write things like >> >> > >> >> > void foo(tainted __user int *x) >> >> > >> >> > and have the compiler figure out what level of speculation it can do and >> >> > (on processors with those features like IA64) when it can and can't do >> >> > various kinds of non-trapping loads. >> >> > >> >> >> >> It would be great if coccinelle and/or smatch could be taught to catch >> >> some of these case at least as a first pass "please audit this code >> >> block" type of notification. >> >> >> > >> > What should one be looking for. Do you have a typical example? >> > >> >> See "Exploiting Conditional Branch Misprediction" from the paper [1]. >> >> The typical example is an attacker controlled index used to trigger a >> dependent read near a branch. Where an example of "near" from the >> paper is "up to 188 simple instructions inserted in the source code >> between the ‘if’ statement and the line accessing array...". >> >> if (attacker_controlled_index < bound) >> val = array[attacker_controlled_index]; >> else >> return error; >> >> ...when the cpu speculates that the 'index < bound' branch is taken it >> reads index and uses that value to read array[index]. The result of an >> 'array' relative read is potentially observable in the cache. > > You still need > > (void) array2[val]; > > after that to get something observable, right? As far as I understand the presence of array2[val] discloses more information, but in terms of the cpu taking an action that it is observable in the cache that's already occurred when "val = array[attacker_controlled_index];" is speculated. Lets err on the side of caution and shut down all the observable actions that are already explicitly gated by an input validation check. In other words, a low bandwidth information leak is still a leak.
On Thu 2018-01-04 21:23:59, Alan Cox wrote: > On Thu, 4 Jan 2018 21:39:24 +0100 (CET) > Jiri Kosina <jikos@kernel.org> wrote: > > > On Thu, 4 Jan 2018, Alan Cox wrote: > > > > > You never go from one user process to another except via the kernel. We > > > have no hardware scheduling going on. That means that if the kernel > > > and/or CPU imposes the correct speculation barriers you can't attack > > > anyone but yourself. > > > > So how does this work on HT with the shared BTB? There is no context > > switch (and hence no IBPB) happening between the threads sharing it. > > > > If you are paranoid in that case you either need to schedule things that > trust each other together or disable the speculation while that situation > occurs. However the kernel is always in the position to make that > decision. Actually... I'm not paranoid but would like to run flightgear on one core (smt cpu #0), with smt cpu#1 being idle, while running compilations on second core (smt cpus #2 and #3). Is there easy way to do that? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Thu, Jan 4, 2018 at 3:47 AM, Mark Rutland <mark.rutland@arm.com> wrote: > Hi Dan, > > Thanks for these examples. > > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: >> Note, the following is Elena's work, I'm just helping poke the upstream >> discussion along while she's offline. >> >> Elena audited the static analysis reports down to the following >> locations where userspace provides a value for a conditional branch and >> then later creates a dependent load on that same userspace controlled >> value. >> >> 'osb()', observable memory barrier, resolves to an lfence on x86. My >> concern with these changes is that it is not clear what content within >> the branch block is of concern. Peter's 'if_nospec' proposal combined >> with Mark's 'nospec_load' would seem to clear that up, from a self >> documenting code perspective, and let archs optionally protect entire >> conditional blocks or individual loads within those blocks. > > I'm a little concerned by having to use two helpers that need to be paired. It > means that we have to duplicate the bounds information, and I suspect in > practice that they'll often be paired improperly. Hmm, will they be often mispaired? All of the examples to date the load occurs in the same code block as the bound checking if() statement. > I think that we can avoid those problems if we use nospec_ptr() on its own. It > returns NULL if the pointer would be out-of-bounds, so we can use it in the > if-statement. > > On x86, that can contain the bounds checks followed be an OSB / lfence, like > if_nospec(). On arm we can use our architected sequence. In either case, > nospec_ptr() returns NULL if the pointer would be out-of-bounds. > > Then, rather than sequences like: > > if_nospec(idx < max) { > val = nospec_array_load(array, idx, max); > ... > } > > ... we could have: > > if ((elem_ptr = nospec_array_ptr(array, idx, max)) { > val = *elem_ptr; > ... > } > > ... which also means we don't have to annotate every single load in the branch > if the element is a structure with multiple fields. We wouldn't need to annotate each load in that case, right? Just the instance that uses idx to calculate an address. if_nospec (idx < max_idx) { elem_ptr = nospec_array_ptr(array, idx, max); val = elem_ptr->val; val2 = elem_ptr->val2; } > Does that sound workable to you? Relative to the urgency of getting fixes upstream I'm ok with whatever approach generates the least debate from sub-system maintainers. One concern is that on x86 the: if ((elem_ptr = nospec_array_ptr(array, idx, max)) { ...approach produces two conditional branches whereas: if_nospec (idx < max_idx) { elem_ptr = nospec_array_ptr(array, idx, max); ....can be optimized to one conditional with the barrier. Is that convincing enough to proceed with if_nospec + nospec_* combination?
On Thu, Jan 4, 2018 at 1:43 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > As far as I understand the presence of array2[val] discloses more > information, but in terms of the cpu taking an action that it is > observable in the cache that's already occurred when "val = > array[attacker_controlled_index];" is speculated. Lets err on the side > of caution and shut down all the observable actions that are already > explicitly gated by an input validation check. In other words, a low > bandwidth information leak is still a leak. I do think that the change to __fcheck_files() is interesting, because that one is potentially rather performance-sensitive. And the data access speculation annotations we presumably won't be able to get rid of eventually with newer hardware, so to some degree this is a bigger deal than the random hacks that affect _current_ hardware but hopefully hw designers will fix in the future. That does make me think that we should strive to perhaps use the same model that the BPF people are looking at: making the address generation part of the computation, rather than using a barrier. I'm not sure how expensive lfence is, but from every single time ever in the history of man, barriers have been a bad idea. Which makes me suspect that lfence is a bad idea too. If one common pattern is going to be bounces checking array accesses like this, then we probably should strive to turn if (index < bound) val = array[index]; into making sure we actually have that data dependency on the address instead. Whether that is done with a "select" instruction or by using an "and" with -1/0 is then a small detail. I wonder if we can make the interface be something really straightforward: - specify a special array_access(array, index, max) operation, and say that the access is guaranteed to not speculatively access outside the array, and return 0 (of the type "array entry") if outside the range. - further specify that it's safe as READ_ONCE() and/or RCU access (and only defined for native data types) That would turn that existing __fcheck_files() from if (fd < fdt->max_fds) return rcu_dereference_raw(fdt->fd[fd]); return NULL; into just return array_access(fdt->fd, fd, ft->max_fds); and the *implementation* might be architecture-specific, but one particular implementation would be something like simply #define array_access(base, idx, max) ({ \ union { typeof(base[0]) _val; unsigned long _bit; } __u;\ unsigned long _i = (idx); \ unsigned long _m = (max); \ unsigned long _mask = _i < _m ? ~0 : 0; \ OPTIMIZER_HIDE_VAR(_mask); \ __u._val = base[_i & _mask]; \ __u._bit &= _mask; \ __u._val; }) (Ok, the above is not exhaustively tested, but you get the idea, and gcc doesn't seem to mess it up *too* badly). No barriers anywhere, except for the compiler barrier to make sure the compiler doesn't see how it all depends on that comparison, and actually uses two "and" instructions to fix the address and the data. How slow is lfence? Maybe it's not slow, but the above looks like it *might* be better.. Linus
On Thu, Jan 4, 2018 at 2:20 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > #define array_access(base, idx, max) ({ \ > union { typeof(base[0]) _val; unsigned long _bit; } __u;\ > unsigned long _i = (idx); \ > unsigned long _m = (max); \ > unsigned long _mask = _i < _m ? ~0 : 0; \ > OPTIMIZER_HIDE_VAR(_mask); \ > __u._val = base[_i & _mask]; \ > __u._bit &= _mask; \ > __u._val; }) That __u._val = base[_i & _mask]; thing would have to be READ_ONCE() to make it all ok for the special cases without locking etc. Linus
Hi! > >> > What should one be looking for. Do you have a typical example? > >> > > >> > >> See "Exploiting Conditional Branch Misprediction" from the paper [1]. > >> > >> The typical example is an attacker controlled index used to trigger a > >> dependent read near a branch. Where an example of "near" from the > >> paper is "up to 188 simple instructions inserted in the source code > >> between the ‘if’ statement and the line accessing array...". > >> > >> if (attacker_controlled_index < bound) > >> val = array[attacker_controlled_index]; > >> else > >> return error; > >> > >> ...when the cpu speculates that the 'index < bound' branch is taken it > >> reads index and uses that value to read array[index]. The result of an > >> 'array' relative read is potentially observable in the cache. > > > > You still need > > > > (void) array2[val]; > > > > after that to get something observable, right? > > As far as I understand the presence of array2[val] discloses more > information, but in terms of the cpu taking an action that it is > observable in the cache that's already occurred when "val = > array[attacker_controlled_index];" is speculated. Lets err on the Well yes, attacker can observe val = array[attacker_controlled_index]; . But that's not something he's interested in. So the CPU cheats and attacker has a proof. But he knew that before. >side > of caution and shut down all the observable actions that are already > explicitly gated by an input validation check. In other words, a low > bandwidth information leak is still a leak. What did it leak? Nothing. Attacker had to know array+attacker_controlled_index, and he now knows (array+attacker_controlled_index)%CACHELINE_SIZE. With (void) array2[val];, the attack gets interesting -- I now know *(array+attacker_controlled_index) % CACHELINE_SIZE ... allowing me to get information from arbitrary place in memory -- which is useful for .. reading ssh keys, for example. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> and the *implementation* might be architecture-specific, but one > particular implementation would be something like simply > > #define array_access(base, idx, max) ({ \ > union { typeof(base[0]) _val; unsigned long _bit; } __u;\ > unsigned long _i = (idx); \ > unsigned long _m = (max); \ > unsigned long _mask = _i < _m ? ~0 : 0; \ > OPTIMIZER_HIDE_VAR(_mask); \ > __u._val = base[_i & _mask]; \ > __u._bit &= _mask; \ > __u._val; }) > > (Ok, the above is not exhaustively tested, but you get the idea, and > gcc doesn't seem to mess it up *too* badly). How do you ensure that the CPU doesn't speculate j < _m ? ~0 : 0 pick the wrong mask and then reference base[] ? It's a nice idea but I think we'd need to run it past the various CPU vendors especially before it was implemented as a generic solution. Anding with a constant works because the constant doesn't get speculated and nor does the and with a constant, but you've got a whole additional conditional path in your macro. Alan
On Thu, Jan 4, 2018 at 2:55 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: > > How do you ensure that the CPU doesn't speculate j < _m ? ~0 : 0 pick the > wrong mask and then reference base[] ? .. yeah, that's exactly where we want to make sure that the compiler uses a select or 'setb'. That's what gcc does for me in testing: xorl %eax, %eax setbe %al negq %rax but yes, we'd need to guarantee it somehow. Presumably that is where we end up having some arch-specific stuff. Possibly there is some gcc builtin. I wanted to avoid actually writing architecture-specific asm. > Anding with a constant works because the constant doesn't get speculated > and nor does the and with a constant, but you've got a whole additional > conditional path in your macro. Absolutely. Think of it as an example, not "the solution". It's also possible that x86 'lfence' really is so fast that it doesn't make sense to try to do this. Agner Fog claims that it's single-cycle (well, except for P4, surprise, surprise), but I suspect that his timings are simply for 'lfence' in a loop or something. Which may not show the real cost of actually halting things until they are stable. Also, maybe that __fcheck_files() pattern where getting a NULL pointer happens to be the right thing for out-of-range is so unusual as to be useless, and most people end up having to have that limit check for other reasons anyway. Linus
> It's also possible that x86 'lfence' really is so fast that it doesn't > make sense to try to do this. If you've got a lot going on it's not 1 cycle. Alan
On Thu, Jan 4, 2018 at 2:44 PM, Pavel Machek <pavel@ucw.cz> wrote: > Hi! > >> >> > What should one be looking for. Do you have a typical example? >> >> > >> >> >> >> See "Exploiting Conditional Branch Misprediction" from the paper [1]. >> >> >> >> The typical example is an attacker controlled index used to trigger a >> >> dependent read near a branch. Where an example of "near" from the >> >> paper is "up to 188 simple instructions inserted in the source code >> >> between the ‘if’ statement and the line accessing array...". >> >> >> >> if (attacker_controlled_index < bound) >> >> val = array[attacker_controlled_index]; >> >> else >> >> return error; >> >> >> >> ...when the cpu speculates that the 'index < bound' branch is taken it >> >> reads index and uses that value to read array[index]. The result of an >> >> 'array' relative read is potentially observable in the cache. >> > >> > You still need >> > >> > (void) array2[val]; >> > >> > after that to get something observable, right? >> >> As far as I understand the presence of array2[val] discloses more >> information, but in terms of the cpu taking an action that it is >> observable in the cache that's already occurred when "val = >> array[attacker_controlled_index];" is speculated. Lets err on the > > Well yes, attacker can observe val = > array[attacker_controlled_index]; . But that's not something he's > interested in. So the CPU cheats and attacker has a proof. But he knew > that before. > >>side >> of caution and shut down all the observable actions that are already >> explicitly gated by an input validation check. In other words, a low >> bandwidth information leak is still a leak. > > What did it leak? Nothing. Attacker had to know > array+attacker_controlled_index, and he now knows > (array+attacker_controlled_index)%CACHELINE_SIZE. > > With (void) array2[val];, the attack gets interesting -- I now know > *(array+attacker_controlled_index) % CACHELINE_SIZE ... allowing me to > get information from arbitrary place in memory -- which is useful for > .. reading ssh keys, for example. Right, but how far away from "val = array[attacker_controlled_index];" in the instruction stream do you need to look before you're comfortable there's no 'val' dependent reads in the speculation window on all possible architectures. Until we have variable annotations and compiler help my guess is that static analysis has an easier time pointing us to the first potentially bad speculative access.
> Right, but how far away from "val = array[attacker_controlled_index];" > in the instruction stream do you need to look before you're > comfortable there's no 'val' dependent reads in the speculation window > on all possible architectures. Until we have variable annotations and > compiler help my guess is that static analysis has an easier time > pointing us to the first potentially bad speculative access. On x86 the researchers are saying up to 180 or so simple instructions. I've not seen any Intel or AMD official quoted value. It's enough that you need to peer across functions. Alan
Hi! > > What did it leak? Nothing. Attacker had to know > > array+attacker_controlled_index, and he now knows > > (array+attacker_controlled_index)%CACHELINE_SIZE. > > > > With (void) array2[val];, the attack gets interesting -- I now know > > *(array+attacker_controlled_index) % CACHELINE_SIZE ... allowing me to > > get information from arbitrary place in memory -- which is useful for > > .. reading ssh keys, for example. > > Right, but how far away from "val = array[attacker_controlled_index];" > in the instruction stream do you need to look before you're > comfortable there's no 'val' dependent reads in the speculation window > on all possible architectures. Until we have variable annotations and > compiler help my guess is that static analysis has an easier time > pointing us to the first potentially bad speculative access. Well, you are already scanning for if (attacker_controlled_index < limit) .... array[attacker_controlled_index] and those can already be far away from each other.... Anyway, likely in the end human should be creating the patch, and if there's no array2[val], we do not need the patch after all. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Thu, Jan 4, 2018 at 3:06 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jan 4, 2018 at 2:55 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: >> >> How do you ensure that the CPU doesn't speculate j < _m ? ~0 : 0 pick the >> wrong mask and then reference base[] ? > > .. yeah, that's exactly where we want to make sure that the compiler > uses a select or 'setb'. > > That's what gcc does for me in testing: > > xorl %eax, %eax > setbe %al > negq %rax > > but yes, we'd need to guarantee it somehow. > > Presumably that is where we end up having some arch-specific stuff. > Possibly there is some gcc builtin. I wanted to avoid actually writing > architecture-specific asm. > >> Anding with a constant works because the constant doesn't get speculated >> and nor does the and with a constant, but you've got a whole additional >> conditional path in your macro. > > Absolutely. Think of it as an example, not "the solution". > > It's also possible that x86 'lfence' really is so fast that it doesn't > make sense to try to do this. Agner Fog claims that it's single-cycle > (well, except for P4, surprise, surprise), but I suspect that his > timings are simply for 'lfence' in a loop or something. Which may not > show the real cost of actually halting things until they are stable. > > Also, maybe that __fcheck_files() pattern where getting a NULL pointer > happens to be the right thing for out-of-range is so unusual as to be > useless, and most people end up having to have that limit check for > other reasons anyway. This potential barrier avoidance optimization technique is something that could fit in the nospec_{ptr,load,array_load} interface that Mark defined for ARM, and is constructed around a proposed compiler builtin. Although, lfence is not a full serializing instruction, so before we spend too much effort trying to kill it we should measure how bad it is in practice in these paths. At this point I'll go ahead with rewriting the osb() patches into using Mark's nospec_* accessors plus the if_nospec macro. We can kill the barrier in if_nospec once we are sure the compiler will always "Do the Right Thing" with the array_access() approach, and can otherwise make the array_access() approach the 'best-effort' default fallback.
It looks like the problem in terms of detection is to find values that should be annotated as __user. Poking around a bit, it seems like this tool is doing just that: http://www.cs.umd.edu/~jfoster/cqual/ It dates from 2004, but perhaps the developer could be motivated to pick it up again. I don't think Coccinelle would be good for doing this (ie implementing taint analysis) because the dataflow is too complicated. julia
On Thu, Jan 04, 2018 at 02:09:52PM -0800, Dan Williams wrote: > On Thu, Jan 4, 2018 at 3:47 AM, Mark Rutland <mark.rutland@arm.com> wrote: > > Hi Dan, > > > > Thanks for these examples. > > > > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: > >> Note, the following is Elena's work, I'm just helping poke the upstream > >> discussion along while she's offline. > >> > >> Elena audited the static analysis reports down to the following > >> locations where userspace provides a value for a conditional branch and > >> then later creates a dependent load on that same userspace controlled > >> value. > >> > >> 'osb()', observable memory barrier, resolves to an lfence on x86. My > >> concern with these changes is that it is not clear what content within > >> the branch block is of concern. Peter's 'if_nospec' proposal combined > >> with Mark's 'nospec_load' would seem to clear that up, from a self > >> documenting code perspective, and let archs optionally protect entire > >> conditional blocks or individual loads within those blocks. > > > > I'm a little concerned by having to use two helpers that need to be paired. It > > means that we have to duplicate the bounds information, and I suspect in > > practice that they'll often be paired improperly. > > Hmm, will they be often mispaired? All of the examples to date the > load occurs in the same code block as the bound checking if() > statement. Practically speaking, barriers get misused all the time, and having a single helper minimizes the risk or misuse. > > I think that we can avoid those problems if we use nospec_ptr() on its own. It > > returns NULL if the pointer would be out-of-bounds, so we can use it in the > > if-statement. > > > > On x86, that can contain the bounds checks followed be an OSB / lfence, like > > if_nospec(). On arm we can use our architected sequence. In either case, > > nospec_ptr() returns NULL if the pointer would be out-of-bounds. > > > > Then, rather than sequences like: > > > > if_nospec(idx < max) { > > val = nospec_array_load(array, idx, max); > > ... > > } > > > > ... we could have: > > > > if ((elem_ptr = nospec_array_ptr(array, idx, max)) { > > val = *elem_ptr; > > ... > > } > > > > ... which also means we don't have to annotate every single load in the branch > > if the element is a structure with multiple fields. > > We wouldn't need to annotate each load in that case, right? Just the > instance that uses idx to calculate an address. Correct. > if_nospec (idx < max_idx) { > elem_ptr = nospec_array_ptr(array, idx, max); > val = elem_ptr->val; > val2 = elem_ptr->val2; > } > > > Does that sound workable to you? > > Relative to the urgency of getting fixes upstream I'm ok with whatever > approach generates the least debate from sub-system maintainers. > > One concern is that on x86 the: > > if ((elem_ptr = nospec_array_ptr(array, idx, max)) { > > ...approach produces two conditional branches whereas: > > if_nospec (idx < max_idx) { > elem_ptr = nospec_array_ptr(array, idx, max); > > ....can be optimized to one conditional with the barrier. Do you mean because the NULL check is redundant? I was hoping that the compiler would have the necessary visibility to fold that with the bounds check (on the assumption that the array base must not be NULL). Thanks, Mark.
On Fri, Jan 5, 2018 at 6:40 AM, Mark Rutland <mark.rutland@arm.com> wrote: > On Thu, Jan 04, 2018 at 02:09:52PM -0800, Dan Williams wrote: >> On Thu, Jan 4, 2018 at 3:47 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> > Hi Dan, >> > >> > Thanks for these examples. >> > >> > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: >> >> Note, the following is Elena's work, I'm just helping poke the upstream >> >> discussion along while she's offline. >> >> >> >> Elena audited the static analysis reports down to the following >> >> locations where userspace provides a value for a conditional branch and >> >> then later creates a dependent load on that same userspace controlled >> >> value. >> >> >> >> 'osb()', observable memory barrier, resolves to an lfence on x86. My >> >> concern with these changes is that it is not clear what content within >> >> the branch block is of concern. Peter's 'if_nospec' proposal combined >> >> with Mark's 'nospec_load' would seem to clear that up, from a self >> >> documenting code perspective, and let archs optionally protect entire >> >> conditional blocks or individual loads within those blocks. >> > >> > I'm a little concerned by having to use two helpers that need to be paired. It >> > means that we have to duplicate the bounds information, and I suspect in >> > practice that they'll often be paired improperly. >> >> Hmm, will they be often mispaired? All of the examples to date the >> load occurs in the same code block as the bound checking if() >> statement. > > Practically speaking, barriers get misused all the time, and having a > single helper minimizes the risk or misuse. I agree, but this is why if_nospec hides the barrier / makes it implicit. > >> > I think that we can avoid those problems if we use nospec_ptr() on its own. It >> > returns NULL if the pointer would be out-of-bounds, so we can use it in the >> > if-statement. >> > >> > On x86, that can contain the bounds checks followed be an OSB / lfence, like >> > if_nospec(). On arm we can use our architected sequence. In either case, >> > nospec_ptr() returns NULL if the pointer would be out-of-bounds. >> > >> > Then, rather than sequences like: >> > >> > if_nospec(idx < max) { >> > val = nospec_array_load(array, idx, max); >> > ... >> > } >> > >> > ... we could have: >> > >> > if ((elem_ptr = nospec_array_ptr(array, idx, max)) { >> > val = *elem_ptr; >> > ... >> > } >> > >> > ... which also means we don't have to annotate every single load in the branch >> > if the element is a structure with multiple fields. >> >> We wouldn't need to annotate each load in that case, right? Just the >> instance that uses idx to calculate an address. > > Correct. > >> if_nospec (idx < max_idx) { >> elem_ptr = nospec_array_ptr(array, idx, max); >> val = elem_ptr->val; >> val2 = elem_ptr->val2; >> } >> >> > Does that sound workable to you? >> >> Relative to the urgency of getting fixes upstream I'm ok with whatever >> approach generates the least debate from sub-system maintainers. >> >> One concern is that on x86 the: >> >> if ((elem_ptr = nospec_array_ptr(array, idx, max)) { >> >> ...approach produces two conditional branches whereas: >> >> if_nospec (idx < max_idx) { >> elem_ptr = nospec_array_ptr(array, idx, max); >> >> ....can be optimized to one conditional with the barrier. > > Do you mean because the NULL check is redundant? I was hoping that the > compiler would have the necessary visibility to fold that with the > bounds check (on the assumption that the array base must not be NULL). ...but there's legitimately 2 conditionals one to control the non-speculative flow, and one to sink the speculation ala the array_access() approach from Linus. Keeping those separate seems to lead to less change in the suspected areas. In any event I'll post the reworked patches and we can iterate from there.
On Fri, Jan 5, 2018 at 8:44 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Fri, Jan 5, 2018 at 6:40 AM, Mark Rutland <mark.rutland@arm.com> wrote: >> On Thu, Jan 04, 2018 at 02:09:52PM -0800, Dan Williams wrote: >>> On Thu, Jan 4, 2018 at 3:47 AM, Mark Rutland <mark.rutland@arm.com> wrote: >>> > Hi Dan, >>> > >>> > Thanks for these examples. >>> > >>> > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote: >>> >> Note, the following is Elena's work, I'm just helping poke the upstream >>> >> discussion along while she's offline. >>> >> >>> >> Elena audited the static analysis reports down to the following >>> >> locations where userspace provides a value for a conditional branch and >>> >> then later creates a dependent load on that same userspace controlled >>> >> value. >>> >> >>> >> 'osb()', observable memory barrier, resolves to an lfence on x86. My >>> >> concern with these changes is that it is not clear what content within >>> >> the branch block is of concern. Peter's 'if_nospec' proposal combined >>> >> with Mark's 'nospec_load' would seem to clear that up, from a self >>> >> documenting code perspective, and let archs optionally protect entire >>> >> conditional blocks or individual loads within those blocks. >>> > >>> > I'm a little concerned by having to use two helpers that need to be paired. It >>> > means that we have to duplicate the bounds information, and I suspect in >>> > practice that they'll often be paired improperly. >>> >>> Hmm, will they be often mispaired? All of the examples to date the >>> load occurs in the same code block as the bound checking if() >>> statement. >> >> Practically speaking, barriers get misused all the time, and having a >> single helper minimizes the risk or misuse. > > I agree, but this is why if_nospec hides the barrier / makes it implicit. > >> >>> > I think that we can avoid those problems if we use nospec_ptr() on its own. It >>> > returns NULL if the pointer would be out-of-bounds, so we can use it in the >>> > if-statement. >>> > >>> > On x86, that can contain the bounds checks followed be an OSB / lfence, like >>> > if_nospec(). On arm we can use our architected sequence. In either case, >>> > nospec_ptr() returns NULL if the pointer would be out-of-bounds. >>> > >>> > Then, rather than sequences like: >>> > >>> > if_nospec(idx < max) { >>> > val = nospec_array_load(array, idx, max); >>> > ... >>> > } >>> > >>> > ... we could have: >>> > >>> > if ((elem_ptr = nospec_array_ptr(array, idx, max)) { >>> > val = *elem_ptr; >>> > ... >>> > } >>> > >>> > ... which also means we don't have to annotate every single load in the branch >>> > if the element is a structure with multiple fields. >>> >>> We wouldn't need to annotate each load in that case, right? Just the >>> instance that uses idx to calculate an address. >> >> Correct. >> >>> if_nospec (idx < max_idx) { >>> elem_ptr = nospec_array_ptr(array, idx, max); >>> val = elem_ptr->val; >>> val2 = elem_ptr->val2; >>> } >>> >>> > Does that sound workable to you? >>> >>> Relative to the urgency of getting fixes upstream I'm ok with whatever >>> approach generates the least debate from sub-system maintainers. >>> >>> One concern is that on x86 the: >>> >>> if ((elem_ptr = nospec_array_ptr(array, idx, max)) { >>> >>> ...approach produces two conditional branches whereas: >>> >>> if_nospec (idx < max_idx) { >>> elem_ptr = nospec_array_ptr(array, idx, max); >>> >>> ....can be optimized to one conditional with the barrier. >> >> Do you mean because the NULL check is redundant? I was hoping that the >> compiler would have the necessary visibility to fold that with the >> bounds check (on the assumption that the array base must not be NULL). > > ...but there's legitimately 2 conditionals one to control the > non-speculative flow, and one to sink the speculation ala the > array_access() approach from Linus. Keeping those separate seems to > lead to less change in the suspected areas. In any event I'll post the > reworked patches and we can iterate from there. Disregard this, I'm going ahead with your new nospec_array_ptr() helper as it falls out nicely and seems to be equally self documenting as 'if_nospec'. Since I was already done with the if_nospec + nospec_array_load conversions it's not much more work to go back and roll the nospec_array_ptr conversion on top.