Message ID | 20210818074352.29950-1-galpress@amazon.com |
---|---|
State | New |
Headers | show |
Series | [RFC] Make use of non-dynamic dmabuf in RDMA | expand |
On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote: > > Hey all, > > Currently, the RDMA subsystem can only work with dynamic dmabuf > attachments, which requires the RDMA device to support on-demand-paging > (ODP) which is not common on most devices (only supported by mlx5). > > While the dynamic requirement makes sense for certain GPUs, some devices > (such as habanalabs) have device memory that is always "pinned" and do > not need/use the move_notify operation. > > The motivation of this RFC is to use habanalabs as the dmabuf exporter, > and EFA as the importer to allow for peer2peer access through libibverbs. > > This draft patch changes the dmabuf driver to differentiate between > static/dynamic attachments by looking at the move_notify op instead of > the importer_ops struct, and allowing the peer2peer flag to be enabled > in case of a static exporter. > > Thanks > > Signed-off-by: Gal Pressman <galpress@amazon.com> Given that habanalabs dma-buf support is very firmly in limbo (at least it's not yet in linux-next or anywhere else) I think you want to solve that problem first before we tackle the additional issue of making p2p work without dynamic dma-buf. Without that it just doesn't make a lot of sense really to talk about solutions here. -Daniel > --- > drivers/dma-buf/dma-buf.c | 5 +++-- > drivers/infiniband/core/umem_dmabuf.c | 2 +- > include/linux/dma-buf.h | 2 +- > 3 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index 511fe0d217a0..e3faad8f492c 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -727,7 +727,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, > if (WARN_ON(!dmabuf || !dev)) > return ERR_PTR(-EINVAL); > > - if (WARN_ON(importer_ops && !importer_ops->move_notify)) > + if (WARN_ON(importer_ops && !importer_ops->move_notify && > + dma_buf_is_dynamic(attach->dmabuf))) > return ERR_PTR(-EINVAL); > > attach = kzalloc(sizeof(*attach), GFP_KERNEL); > @@ -1048,7 +1049,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf) > dma_resv_assert_held(dmabuf->resv); > > list_for_each_entry(attach, &dmabuf->attachments, node) > - if (attach->importer_ops) > + if (attach->importer_ops && attach->importer_ops->move_notify) > attach->importer_ops->move_notify(attach); > } > EXPORT_SYMBOL_GPL(dma_buf_move_notify); > diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c > index c6e875619fac..c502ae828bd3 100644 > --- a/drivers/infiniband/core/umem_dmabuf.c > +++ b/drivers/infiniband/core/umem_dmabuf.c > @@ -118,7 +118,7 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device, > if (check_add_overflow(offset, (unsigned long)size, &end)) > return ret; > > - if (unlikely(!ops || !ops->move_notify)) > + if (unlikely(!ops)) > return ret; > > dmabuf = dma_buf_get(fd); > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index efdc56b9d95f..4b2e99012cb1 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -473,7 +473,7 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf) > static inline bool > dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) > { > - return !!attach->importer_ops; > + return !!attach->importer_ops->move_notify; > } > > struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, > -- > 2.32.0 >
On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote: > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote: > > > > Hey all, > > > > Currently, the RDMA subsystem can only work with dynamic dmabuf > > attachments, which requires the RDMA device to support on-demand-paging > > (ODP) which is not common on most devices (only supported by mlx5). > > > > While the dynamic requirement makes sense for certain GPUs, some devices > > (such as habanalabs) have device memory that is always "pinned" and do > > not need/use the move_notify operation. > > > > The motivation of this RFC is to use habanalabs as the dmabuf exporter, > > and EFA as the importer to allow for peer2peer access through libibverbs. > > > > This draft patch changes the dmabuf driver to differentiate between > > static/dynamic attachments by looking at the move_notify op instead of > > the importer_ops struct, and allowing the peer2peer flag to be enabled > > in case of a static exporter. > > > > Thanks > > > > Signed-off-by: Gal Pressman <galpress@amazon.com> > > Given that habanalabs dma-buf support is very firmly in limbo (at > least it's not yet in linux-next or anywhere else) I think you want to > solve that problem first before we tackle the additional issue of > making p2p work without dynamic dma-buf. Without that it just doesn't > make a lot of sense really to talk about solutions here. I have been thinking about adding a dmabuf exporter to VFIO, for basically the same reason habana labs wants to do it. In that situation we'd want to see an approach similar to this as well to have a broad usability. The GPU drivers also want this for certain sophisticated scenarios with RDMA, the intree drivers just haven't quite got there yet. So, I think it is worthwhile to start thinking about this regardless of habana labs. Jason
On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote: > > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote: > > > > > > Hey all, > > > > > > Currently, the RDMA subsystem can only work with dynamic dmabuf > > > attachments, which requires the RDMA device to support on-demand-paging > > > (ODP) which is not common on most devices (only supported by mlx5). > > > > > > While the dynamic requirement makes sense for certain GPUs, some devices > > > (such as habanalabs) have device memory that is always "pinned" and do > > > not need/use the move_notify operation. > > > > > > The motivation of this RFC is to use habanalabs as the dmabuf exporter, > > > and EFA as the importer to allow for peer2peer access through libibverbs. > > > > > > This draft patch changes the dmabuf driver to differentiate between > > > static/dynamic attachments by looking at the move_notify op instead of > > > the importer_ops struct, and allowing the peer2peer flag to be enabled > > > in case of a static exporter. > > > > > > Thanks > > > > > > Signed-off-by: Gal Pressman <galpress@amazon.com> > > > > Given that habanalabs dma-buf support is very firmly in limbo (at > > least it's not yet in linux-next or anywhere else) I think you want to > > solve that problem first before we tackle the additional issue of > > making p2p work without dynamic dma-buf. Without that it just doesn't > > make a lot of sense really to talk about solutions here. > > I have been thinking about adding a dmabuf exporter to VFIO, for > basically the same reason habana labs wants to do it. > > In that situation we'd want to see an approach similar to this as well > to have a broad usability. > > The GPU drivers also want this for certain sophisticated scenarios > with RDMA, the intree drivers just haven't quite got there yet. > > So, I think it is worthwhile to start thinking about this regardless > of habana labs. Oh sure, I've been having these for a while. I think there's two options: - some kind of soft-pin, where the contract is that we only revoke when absolutely necessary, and it's expected to be catastrophic on the importer's side. The use-case would be single user that fully controls all accelerator local memory, and so kernel driver evicting stuff. I havent implemented it, but the idea is that essentially in eviction we check whom we're evicting for (by checking owners of buffers maybe, atm those are not tracked in generic code but not that hard to add), and if it's the same userspace owner we don't ever pick these buffers as victims for eviction, preferreing -ENOMEM/-ENOSPC. If a new user comes around then we'd still throw these out to avoid abuse, and it would be up to sysadmins to make sure this doesn't happen untimely, maybe with the next thing. - cgroups for (pinned) buffers. Mostly because cgroups for local memory is somewhere on the plans anyway, but that one could also take forever since there's questions about overlap with memcg and things like that, plus thus far everyone who cares made and incompatible proposal about how it should be done :-/ A variant of the first one would be device-level revoke, which is a concept we already have in drm for the modesetting side and also for like 20 year old gpu drivers. We could brush that off and close some of the gaps (a student is fixing the locking right now, the thing left to do is mmap revoke), and I think that model of exclusive device ownership with the option to revoke fits pretty well for at least some of the accelerators floating around. In that case importers would never get a move_notify (maybe we should call this revoke_notify to make it clear it's a bit different) callback, except when the entire thing has been yanked. I think that would fit pretty well for VFIO, and I think we should be able to make it work for rdma too as some kind of auto-deregister. The locking might be fun with both of these since I expect some inversions compared to the register path, we'll have to figure these out. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Fri, Aug 20, 2021 at 09:25:30AM +0200, Daniel Vetter wrote: > On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote: > > > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote: > > > > > > > > Hey all, > > > > > > > > Currently, the RDMA subsystem can only work with dynamic dmabuf > > > > attachments, which requires the RDMA device to support on-demand-paging > > > > (ODP) which is not common on most devices (only supported by mlx5). > > > > > > > > While the dynamic requirement makes sense for certain GPUs, some devices > > > > (such as habanalabs) have device memory that is always "pinned" and do > > > > not need/use the move_notify operation. > > > > > > > > The motivation of this RFC is to use habanalabs as the dmabuf exporter, > > > > and EFA as the importer to allow for peer2peer access through libibverbs. > > > > > > > > This draft patch changes the dmabuf driver to differentiate between > > > > static/dynamic attachments by looking at the move_notify op instead of > > > > the importer_ops struct, and allowing the peer2peer flag to be enabled > > > > in case of a static exporter. > > > > > > > > Thanks > > > > > > > > Signed-off-by: Gal Pressman <galpress@amazon.com> > > > > > > Given that habanalabs dma-buf support is very firmly in limbo (at > > > least it's not yet in linux-next or anywhere else) I think you want to > > > solve that problem first before we tackle the additional issue of > > > making p2p work without dynamic dma-buf. Without that it just doesn't > > > make a lot of sense really to talk about solutions here. > > > > I have been thinking about adding a dmabuf exporter to VFIO, for > > basically the same reason habana labs wants to do it. > > > > In that situation we'd want to see an approach similar to this as well > > to have a broad usability. > > > > The GPU drivers also want this for certain sophisticated scenarios > > with RDMA, the intree drivers just haven't quite got there yet. > > > > So, I think it is worthwhile to start thinking about this regardless > > of habana labs. > > Oh sure, I've been having these for a while. I think there's two options: > - some kind of soft-pin, where the contract is that we only revoke > when absolutely necessary, and it's expected to be catastrophic on the > importer's side. Honestly, I'm not very keen on this. We don't really have HW support in several RDMA scenarios for even catastrophic unpin. Gal, can EFA even do this for a MR? You basically have to resize the rkey/lkey to zero length (or invalidate it like a FMR) under the catstrophic revoke. The rkey/lkey cannot just be destroyed as that opens a security problem with rkey/lkey re-use. I think I saw EFA's current out of tree implementations had this bug. > to do is mmap revoke), and I think that model of exclusive device > ownership with the option to revoke fits pretty well for at least some > of the accelerators floating around. In that case importers would > never get a move_notify (maybe we should call this revoke_notify to > make it clear it's a bit different) callback, except when the entire > thing has been yanked. I think that would fit pretty well for VFIO, > and I think we should be able to make it work for rdma too as some > kind of auto-deregister. The locking might be fun with both of these > since I expect some inversions compared to the register path, we'll > have to figure these out. It fits semantically nicely, VFIO also has a revoke semantic for BAR mappings. The challenge is the RDMA side which doesn't have a 'dma disabled error state' for objects as part of the spec. Some HW, like mlx5, can implement this for MR objects (see revoke_mr), but I don't know if anything else can, and even mlx5 currently can't do a revoke for any other object type. I don't know how useful it would be, need to check on some of the use cases. The locking is tricky as we have to issue a device command, but that device command cannot run concurrently with destruction or the tail part of creation. Jason
On 20/08/2021 15:33, Jason Gunthorpe wrote: > On Fri, Aug 20, 2021 at 09:25:30AM +0200, Daniel Vetter wrote: >> On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: >>> On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote: >>>> On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote: >>>>> >>>>> Hey all, >>>>> >>>>> Currently, the RDMA subsystem can only work with dynamic dmabuf >>>>> attachments, which requires the RDMA device to support on-demand-paging >>>>> (ODP) which is not common on most devices (only supported by mlx5). >>>>> >>>>> While the dynamic requirement makes sense for certain GPUs, some devices >>>>> (such as habanalabs) have device memory that is always "pinned" and do >>>>> not need/use the move_notify operation. >>>>> >>>>> The motivation of this RFC is to use habanalabs as the dmabuf exporter, >>>>> and EFA as the importer to allow for peer2peer access through libibverbs. >>>>> >>>>> This draft patch changes the dmabuf driver to differentiate between >>>>> static/dynamic attachments by looking at the move_notify op instead of >>>>> the importer_ops struct, and allowing the peer2peer flag to be enabled >>>>> in case of a static exporter. >>>>> >>>>> Thanks >>>>> >>>>> Signed-off-by: Gal Pressman <galpress@amazon.com> >>>> >>>> Given that habanalabs dma-buf support is very firmly in limbo (at >>>> least it's not yet in linux-next or anywhere else) I think you want to >>>> solve that problem first before we tackle the additional issue of >>>> making p2p work without dynamic dma-buf. Without that it just doesn't >>>> make a lot of sense really to talk about solutions here. >>> >>> I have been thinking about adding a dmabuf exporter to VFIO, for >>> basically the same reason habana labs wants to do it. >>> >>> In that situation we'd want to see an approach similar to this as well >>> to have a broad usability. >>> >>> The GPU drivers also want this for certain sophisticated scenarios >>> with RDMA, the intree drivers just haven't quite got there yet. >>> >>> So, I think it is worthwhile to start thinking about this regardless >>> of habana labs. >> >> Oh sure, I've been having these for a while. I think there's two options: >> - some kind of soft-pin, where the contract is that we only revoke >> when absolutely necessary, and it's expected to be catastrophic on the >> importer's side. > > Honestly, I'm not very keen on this. We don't really have HW support > in several RDMA scenarios for even catastrophic unpin. > > Gal, can EFA even do this for a MR? You basically have to resize the > rkey/lkey to zero length (or invalidate it like a FMR) under the > catstrophic revoke. The rkey/lkey cannot just be destroyed as that > opens a security problem with rkey/lkey re-use. I had some discussions with our hardware guys about such functionality in the past, I think it should be doable (not necessarily the same way that FMR does it). Though it would've been nicer if we could agree on a solution that could work for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has. That's why I tried to approach this by denying such attachments for non-ODP importers instead of exposing a "limited" dynamic importer.
On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: > Though it would've been nicer if we could agree on a solution that could work > for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has. I don't think it can really be done, revoke is necessary, and isn't a primitive we have today. Revoke is sort of like rereg MR, but with a guaranteed no-change to the lkey/rkey Then there is the locking complexity of linking the mr creation and destruction to the lifecycle of the pages, which is messy and maybe not general. For instance mlx5 would call its revoke_mr, disconnect the dmabuf then destroy the mkey - but this is only safe because mlx5 HW can handle concurrent revokes. > That's why I tried to approach this by denying such attachments for non-ODP > importers instead of exposing a "limited" dynamic importer. That is fine if there is no revoke - once revoke exists we must have driver and HW support. Jason
On 20/08/2021 17:32, Jason Gunthorpe wrote: > On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: > >> Though it would've been nicer if we could agree on a solution that could work >> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has. > > I don't think it can really be done, revoke is necessary, and isn't a > primitive we have today. > > Revoke is sort of like rereg MR, but with a guaranteed no-change to > the lkey/rkey > > Then there is the locking complexity of linking the mr creation and > destruction to the lifecycle of the pages, which is messy and maybe > not general. For instance mlx5 would call its revoke_mr, disconnect > the dmabuf then destroy the mkey - but this is only safe because mlx5 > HW can handle concurrent revokes. Thanks, that makes sense. >> That's why I tried to approach this by denying such attachments for non-ODP >> importers instead of exposing a "limited" dynamic importer. > > That is fine if there is no revoke - once revoke exists we must have > driver and HW support. Agree. IIUC, we're talking about three different exporter "types": - Dynamic with move_notify (requires ODP) - Dynamic with revoke_notify - Static Which changes do we need to make the third one work?
Am 21.08.21 um 11:16 schrieb Gal Pressman: > On 20/08/2021 17:32, Jason Gunthorpe wrote: >> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: >> >>> Though it would've been nicer if we could agree on a solution that could work >>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has. >> I don't think it can really be done, revoke is necessary, and isn't a >> primitive we have today. >> >> Revoke is sort of like rereg MR, but with a guaranteed no-change to >> the lkey/rkey >> >> Then there is the locking complexity of linking the mr creation and >> destruction to the lifecycle of the pages, which is messy and maybe >> not general. For instance mlx5 would call its revoke_mr, disconnect >> the dmabuf then destroy the mkey - but this is only safe because mlx5 >> HW can handle concurrent revokes. > Thanks, that makes sense. > >>> That's why I tried to approach this by denying such attachments for non-ODP >>> importers instead of exposing a "limited" dynamic importer. >> That is fine if there is no revoke - once revoke exists we must have >> driver and HW support. > Agree. > IIUC, we're talking about three different exporter "types": > - Dynamic with move_notify (requires ODP) > - Dynamic with revoke_notify > - Static > > Which changes do we need to make the third one work? Basically none at all in the framework. You just need to properly use the dma_buf_pin() function when you start using a buffer (e.g. before you create an attachment) and the dma_buf_unpin() function after you are done with the DMA-buf. Regards, Christian.
On 23/08/2021 13:43, Christian König wrote: > Am 21.08.21 um 11:16 schrieb Gal Pressman: >> On 20/08/2021 17:32, Jason Gunthorpe wrote: >>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: >>> >>>> Though it would've been nicer if we could agree on a solution that could work >>>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem >>>> has. >>> I don't think it can really be done, revoke is necessary, and isn't a >>> primitive we have today. >>> >>> Revoke is sort of like rereg MR, but with a guaranteed no-change to >>> the lkey/rkey >>> >>> Then there is the locking complexity of linking the mr creation and >>> destruction to the lifecycle of the pages, which is messy and maybe >>> not general. For instance mlx5 would call its revoke_mr, disconnect >>> the dmabuf then destroy the mkey - but this is only safe because mlx5 >>> HW can handle concurrent revokes. >> Thanks, that makes sense. >> >>>> That's why I tried to approach this by denying such attachments for non-ODP >>>> importers instead of exposing a "limited" dynamic importer. >>> That is fine if there is no revoke - once revoke exists we must have >>> driver and HW support. >> Agree. >> IIUC, we're talking about three different exporter "types": >> - Dynamic with move_notify (requires ODP) >> - Dynamic with revoke_notify >> - Static >> >> Which changes do we need to make the third one work? > > Basically none at all in the framework. > > You just need to properly use the dma_buf_pin() function when you start using a > buffer (e.g. before you create an attachment) and the dma_buf_unpin() function > after you are done with the DMA-buf. I replied to your previous mail, but I'll ask again. Doesn't the pin operation migrate the memory to host memory?
Am 24.08.21 um 11:06 schrieb Gal Pressman: > On 23/08/2021 13:43, Christian König wrote: >> Am 21.08.21 um 11:16 schrieb Gal Pressman: >>> On 20/08/2021 17:32, Jason Gunthorpe wrote: >>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: >>>> >>>>> Though it would've been nicer if we could agree on a solution that could work >>>>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem >>>>> has. >>>> I don't think it can really be done, revoke is necessary, and isn't a >>>> primitive we have today. >>>> >>>> Revoke is sort of like rereg MR, but with a guaranteed no-change to >>>> the lkey/rkey >>>> >>>> Then there is the locking complexity of linking the mr creation and >>>> destruction to the lifecycle of the pages, which is messy and maybe >>>> not general. For instance mlx5 would call its revoke_mr, disconnect >>>> the dmabuf then destroy the mkey - but this is only safe because mlx5 >>>> HW can handle concurrent revokes. >>> Thanks, that makes sense. >>> >>>>> That's why I tried to approach this by denying such attachments for non-ODP >>>>> importers instead of exposing a "limited" dynamic importer. >>>> That is fine if there is no revoke - once revoke exists we must have >>>> driver and HW support. >>> Agree. >>> IIUC, we're talking about three different exporter "types": >>> - Dynamic with move_notify (requires ODP) >>> - Dynamic with revoke_notify >>> - Static >>> >>> Which changes do we need to make the third one work? >> Basically none at all in the framework. >> >> You just need to properly use the dma_buf_pin() function when you start using a >> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function >> after you are done with the DMA-buf. > I replied to your previous mail, but I'll ask again. > Doesn't the pin operation migrate the memory to host memory? Sorry missed your previous reply. And yes at least for the amdgpu driver we migrate the memory to host memory as soon as it is pinned and I would expect that other GPU drivers do something similar. This is intentional since we don't want any P2P to video memory with pinned objects and want to avoid to run into a situation where one device is doing P2P to video memory while another device needs the DMA-buf in host memory. You can still do P2P with pinned object, it's just up to the exporting driver if it is allowed or not. The other option is what Daniel suggested that we have some kind of revoke. This is essentially what our KFD is doing as well when doing interop with 3D GFX, but from Jasons responses I have a bit of doubt that this will actually work on the hardware level for RDMA. Regards, Christian.
On 8/24/21 2:32 AM, Christian König wrote: > Am 24.08.21 um 11:06 schrieb Gal Pressman: >> On 23/08/2021 13:43, Christian König wrote: >>> Am 21.08.21 um 11:16 schrieb Gal Pressman: >>>> On 20/08/2021 17:32, Jason Gunthorpe wrote: >>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: ... >>>> IIUC, we're talking about three different exporter "types": >>>> - Dynamic with move_notify (requires ODP) >>>> - Dynamic with revoke_notify >>>> - Static >>>> >>>> Which changes do we need to make the third one work? >>> Basically none at all in the framework. >>> >>> You just need to properly use the dma_buf_pin() function when you start using a >>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function >>> after you are done with the DMA-buf. >> I replied to your previous mail, but I'll ask again. >> Doesn't the pin operation migrate the memory to host memory? > > Sorry missed your previous reply. > > And yes at least for the amdgpu driver we migrate the memory to host memory as soon as it is pinned > and I would expect that other GPU drivers do something similar. Well...for many topologies, migrating to host memory will result in a dramatically slower p2p setup. For that reason, some GPU drivers may want to allow pinning of video memory in some situations. Ideally, you've got modern ODP devices and you don't even need to pin. But if not, and you still hope to do high performance p2p between a GPU and a non-ODP Infiniband device, then you would need to leave the pinned memory in vidmem. So I think we don't want to rule out that behavior, right? Or is the thinking more like, "you're lucky that this old non-ODP setup works at all, and we'll make it work by routing through host/cpu memory, but it will be slow"? thanks,
On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote: > On 8/24/21 2:32 AM, Christian König wrote: > > Am 24.08.21 um 11:06 schrieb Gal Pressman: > > > On 23/08/2021 13:43, Christian König wrote: > > > > Am 21.08.21 um 11:16 schrieb Gal Pressman: > > > > > On 20/08/2021 17:32, Jason Gunthorpe wrote: > > > > > > On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: > ... > > > > > IIUC, we're talking about three different exporter "types": > > > > > - Dynamic with move_notify (requires ODP) > > > > > - Dynamic with revoke_notify > > > > > - Static > > > > > > > > > > Which changes do we need to make the third one work? > > > > Basically none at all in the framework. > > > > > > > > You just need to properly use the dma_buf_pin() function when you start using a > > > > buffer (e.g. before you create an attachment) and the dma_buf_unpin() function > > > > after you are done with the DMA-buf. > > > I replied to your previous mail, but I'll ask again. > > > Doesn't the pin operation migrate the memory to host memory? > > > > Sorry missed your previous reply. > > > > And yes at least for the amdgpu driver we migrate the memory to host > > memory as soon as it is pinned and I would expect that other GPU drivers > > do something similar. > > Well...for many topologies, migrating to host memory will result in a > dramatically slower p2p setup. For that reason, some GPU drivers may > want to allow pinning of video memory in some situations. > > Ideally, you've got modern ODP devices and you don't even need to pin. > But if not, and you still hope to do high performance p2p between a GPU > and a non-ODP Infiniband device, then you would need to leave the pinned > memory in vidmem. > > So I think we don't want to rule out that behavior, right? Or is the > thinking more like, "you're lucky that this old non-ODP setup works at > all, and we'll make it work by routing through host/cpu memory, but it > will be slow"? I think it depends on the user, if the user creates memory which is permanently located on the GPU then it should be pinnable in this way without force migration. But if the memory is inherently migratable then it just cannot be pinned in the GPU at all as we can't indefinately block migration from happening eg if the CPU touches it later or something. Jason
On 8/24/21 10:32 AM, Jason Gunthorpe wrote: ... >>> And yes at least for the amdgpu driver we migrate the memory to host >>> memory as soon as it is pinned and I would expect that other GPU drivers >>> do something similar. >> >> Well...for many topologies, migrating to host memory will result in a >> dramatically slower p2p setup. For that reason, some GPU drivers may >> want to allow pinning of video memory in some situations. >> >> Ideally, you've got modern ODP devices and you don't even need to pin. >> But if not, and you still hope to do high performance p2p between a GPU >> and a non-ODP Infiniband device, then you would need to leave the pinned >> memory in vidmem. >> >> So I think we don't want to rule out that behavior, right? Or is the >> thinking more like, "you're lucky that this old non-ODP setup works at >> all, and we'll make it work by routing through host/cpu memory, but it >> will be slow"? > > I think it depends on the user, if the user creates memory which is > permanently located on the GPU then it should be pinnable in this way > without force migration. But if the memory is inherently migratable > then it just cannot be pinned in the GPU at all as we can't > indefinately block migration from happening eg if the CPU touches it > later or something. > OK. I just want to avoid creating any API-level assumptions that dma_buf_pin() necessarily implies or requires migrating to host memory. thanks,
On Wed, 25 Aug 2021 at 03:36, John Hubbard <jhubbard@nvidia.com> wrote: > > On 8/24/21 10:32 AM, Jason Gunthorpe wrote: > ... > >>> And yes at least for the amdgpu driver we migrate the memory to host > >>> memory as soon as it is pinned and I would expect that other GPU drivers > >>> do something similar. > >> > >> Well...for many topologies, migrating to host memory will result in a > >> dramatically slower p2p setup. For that reason, some GPU drivers may > >> want to allow pinning of video memory in some situations. > >> > >> Ideally, you've got modern ODP devices and you don't even need to pin. > >> But if not, and you still hope to do high performance p2p between a GPU > >> and a non-ODP Infiniband device, then you would need to leave the pinned > >> memory in vidmem. > >> > >> So I think we don't want to rule out that behavior, right? Or is the > >> thinking more like, "you're lucky that this old non-ODP setup works at > >> all, and we'll make it work by routing through host/cpu memory, but it > >> will be slow"? > > > > I think it depends on the user, if the user creates memory which is > > permanently located on the GPU then it should be pinnable in this way > > without force migration. But if the memory is inherently migratable > > then it just cannot be pinned in the GPU at all as we can't > > indefinately block migration from happening eg if the CPU touches it > > later or something. > > > > OK. I just want to avoid creating any API-level assumptions that dma_buf_pin() > necessarily implies or requires migrating to host memory. I'm not sure we should be allowing dma_buf_pin at all on non-migratable memory, what's to stop someone just pinning all the VRAM and making the GPU unuseable? I understand not considering more than a single user in these situations is enterprise thinking, but I do worry about pinning is always fine type of thinking when things are shared or multi-user. My impression from this is we've designed hardware that didn't consider the problem, and now to let us use that hardware in horrible ways we should just allow it to pin all the things. Dave.
On Wed, Aug 25, 2021 at 05:15:52AM +1000, Dave Airlie wrote: > On Wed, 25 Aug 2021 at 03:36, John Hubbard <jhubbard@nvidia.com> wrote: > > > > On 8/24/21 10:32 AM, Jason Gunthorpe wrote: > > ... > > >>> And yes at least for the amdgpu driver we migrate the memory to host > > >>> memory as soon as it is pinned and I would expect that other GPU drivers > > >>> do something similar. > > >> > > >> Well...for many topologies, migrating to host memory will result in a > > >> dramatically slower p2p setup. For that reason, some GPU drivers may > > >> want to allow pinning of video memory in some situations. > > >> > > >> Ideally, you've got modern ODP devices and you don't even need to pin. > > >> But if not, and you still hope to do high performance p2p between a GPU > > >> and a non-ODP Infiniband device, then you would need to leave the pinned > > >> memory in vidmem. > > >> > > >> So I think we don't want to rule out that behavior, right? Or is the > > >> thinking more like, "you're lucky that this old non-ODP setup works at > > >> all, and we'll make it work by routing through host/cpu memory, but it > > >> will be slow"? > > > > > > I think it depends on the user, if the user creates memory which is > > > permanently located on the GPU then it should be pinnable in this way > > > without force migration. But if the memory is inherently migratable > > > then it just cannot be pinned in the GPU at all as we can't > > > indefinately block migration from happening eg if the CPU touches it > > > later or something. > > > > > > > OK. I just want to avoid creating any API-level assumptions that dma_buf_pin() > > necessarily implies or requires migrating to host memory. > > I'm not sure we should be allowing dma_buf_pin at all on > non-migratable memory, what's to stop someone just pinning all the > VRAM and making the GPU unuseable? IMHO the same thinking that prevents pining all of system ram and making the system unusable? GPU isn't so special here. The main restriction is the pinned memory ulimit. For most out-of-the-box cases this is set to something like 64k For the single-user HPC use cases it is made unlimited. > My impression from this is we've designed hardware that didn't > consider the problem, and now to let us use that hardware in horrible > ways we should just allow it to pin all the things. It is more complex than that, HW that can support dynamic memory under *everything* is complicated (and in some cases slow!). As there is only a weak rational to do this, we don't see it in often in the market. Jason
On Tue, Aug 24, 2021 at 3:16 PM Dave Airlie <airlied@gmail.com> wrote: > > On Wed, 25 Aug 2021 at 03:36, John Hubbard <jhubbard@nvidia.com> wrote: > > > > On 8/24/21 10:32 AM, Jason Gunthorpe wrote: > > ... > > >>> And yes at least for the amdgpu driver we migrate the memory to host > > >>> memory as soon as it is pinned and I would expect that other GPU drivers > > >>> do something similar. > > >> > > >> Well...for many topologies, migrating to host memory will result in a > > >> dramatically slower p2p setup. For that reason, some GPU drivers may > > >> want to allow pinning of video memory in some situations. > > >> > > >> Ideally, you've got modern ODP devices and you don't even need to pin. > > >> But if not, and you still hope to do high performance p2p between a GPU > > >> and a non-ODP Infiniband device, then you would need to leave the pinned > > >> memory in vidmem. > > >> > > >> So I think we don't want to rule out that behavior, right? Or is the > > >> thinking more like, "you're lucky that this old non-ODP setup works at > > >> all, and we'll make it work by routing through host/cpu memory, but it > > >> will be slow"? > > > > > > I think it depends on the user, if the user creates memory which is > > > permanently located on the GPU then it should be pinnable in this way > > > without force migration. But if the memory is inherently migratable > > > then it just cannot be pinned in the GPU at all as we can't > > > indefinately block migration from happening eg if the CPU touches it > > > later or something. > > > > > > > OK. I just want to avoid creating any API-level assumptions that dma_buf_pin() > > necessarily implies or requires migrating to host memory. > > I'm not sure we should be allowing dma_buf_pin at all on > non-migratable memory, what's to stop someone just pinning all the > VRAM and making the GPU unuseable? In a lot of cases we have GPUs with more VRAM than system memory, but we allow pinning in system memory. Alex > > I understand not considering more than a single user in these > situations is enterprise thinking, but I do worry about pinning is > always fine type of thinking when things are shared or multi-user. > > My impression from this is we've designed hardware that didn't > consider the problem, and now to let us use that hardware in horrible > ways we should just allow it to pin all the things. > > Dave.
> -----Original Message----- > From: Alex Deucher <alexdeucher@gmail.com> > Sent: Tuesday, August 24, 2021 12:44 PM > To: Dave Airlie <airlied@gmail.com> > Cc: John Hubbard <jhubbard@nvidia.com>; Jason Gunthorpe <jgg@ziepe.ca>; Christian König <christian.koenig@amd.com>; Gal Pressman > <galpress@amazon.com>; Daniel Vetter <daniel@ffwll.ch>; Sumit Semwal <sumit.semwal@linaro.org>; Doug Ledford > <dledford@redhat.com>; open list:DMA BUFFER SHARING FRAMEWORK <linux-media@vger.kernel.org>; dri-devel <dri- > devel@lists.freedesktop.org>; Linux Kernel Mailing List <linux-kernel@vger.kernel.org>; linux-rdma <linux-rdma@vger.kernel.org>; Gabbay, > Oded (Habana) <ogabbay@habana.ai>; Tayar, Tomer (Habana) <ttayar@habana.ai>; Yossi Leybovich <sleybo@amazon.com>; Alexander > Matushevsky <matua@amazon.com>; Leon Romanovsky <leonro@nvidia.com>; Xiong, Jianxin <jianxin.xiong@intel.com> > Subject: Re: [RFC] Make use of non-dynamic dmabuf in RDMA > > On Tue, Aug 24, 2021 at 3:16 PM Dave Airlie <airlied@gmail.com> wrote: > > > > On Wed, 25 Aug 2021 at 03:36, John Hubbard <jhubbard@nvidia.com> wrote: > > > > > > On 8/24/21 10:32 AM, Jason Gunthorpe wrote: > > > ... > > > >>> And yes at least for the amdgpu driver we migrate the memory to > > > >>> host memory as soon as it is pinned and I would expect that > > > >>> other GPU drivers do something similar. > > > >> > > > >> Well...for many topologies, migrating to host memory will result > > > >> in a dramatically slower p2p setup. For that reason, some GPU > > > >> drivers may want to allow pinning of video memory in some situations. > > > >> > > > >> Ideally, you've got modern ODP devices and you don't even need to pin. > > > >> But if not, and you still hope to do high performance p2p between > > > >> a GPU and a non-ODP Infiniband device, then you would need to > > > >> leave the pinned memory in vidmem. > > > >> > > > >> So I think we don't want to rule out that behavior, right? Or is > > > >> the thinking more like, "you're lucky that this old non-ODP setup > > > >> works at all, and we'll make it work by routing through host/cpu > > > >> memory, but it will be slow"? > > > > > > > > I think it depends on the user, if the user creates memory which > > > > is permanently located on the GPU then it should be pinnable in > > > > this way without force migration. But if the memory is inherently > > > > migratable then it just cannot be pinned in the GPU at all as we > > > > can't indefinately block migration from happening eg if the CPU > > > > touches it later or something. > > > > > > > > > > OK. I just want to avoid creating any API-level assumptions that > > > dma_buf_pin() necessarily implies or requires migrating to host memory. > > > > I'm not sure we should be allowing dma_buf_pin at all on > > non-migratable memory, what's to stop someone just pinning all the > > VRAM and making the GPU unuseable? > > In a lot of cases we have GPUs with more VRAM than system memory, but we allow pinning in system memory. > > Alex > In addition, the dma-buf exporter can be a non-GPU device. Jianxin > > > > I understand not considering more than a single user in these > > situations is enterprise thinking, but I do worry about pinning is > > always fine type of thinking when things are shared or multi-user. > > > > My impression from this is we've designed hardware that didn't > > consider the problem, and now to let us use that hardware in horrible > > ways we should just allow it to pin all the things. > > > > Dave.
Am 24.08.21 um 19:32 schrieb Jason Gunthorpe: > On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote: >> On 8/24/21 2:32 AM, Christian König wrote: >>> Am 24.08.21 um 11:06 schrieb Gal Pressman: >>>> On 23/08/2021 13:43, Christian König wrote: >>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman: >>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote: >>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: >> ... >>>>>> IIUC, we're talking about three different exporter "types": >>>>>> - Dynamic with move_notify (requires ODP) >>>>>> - Dynamic with revoke_notify >>>>>> - Static >>>>>> >>>>>> Which changes do we need to make the third one work? >>>>> Basically none at all in the framework. >>>>> >>>>> You just need to properly use the dma_buf_pin() function when you start using a >>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function >>>>> after you are done with the DMA-buf. >>>> I replied to your previous mail, but I'll ask again. >>>> Doesn't the pin operation migrate the memory to host memory? >>> Sorry missed your previous reply. >>> >>> And yes at least for the amdgpu driver we migrate the memory to host >>> memory as soon as it is pinned and I would expect that other GPU drivers >>> do something similar. >> Well...for many topologies, migrating to host memory will result in a >> dramatically slower p2p setup. For that reason, some GPU drivers may >> want to allow pinning of video memory in some situations. >> >> Ideally, you've got modern ODP devices and you don't even need to pin. >> But if not, and you still hope to do high performance p2p between a GPU >> and a non-ODP Infiniband device, then you would need to leave the pinned >> memory in vidmem. >> >> So I think we don't want to rule out that behavior, right? Or is the >> thinking more like, "you're lucky that this old non-ODP setup works at >> all, and we'll make it work by routing through host/cpu memory, but it >> will be slow"? > I think it depends on the user, if the user creates memory which is > permanently located on the GPU then it should be pinnable in this way > without force migration. But if the memory is inherently migratable > then it just cannot be pinned in the GPU at all as we can't > indefinately block migration from happening eg if the CPU touches it > later or something. Yes, exactly that's the point. Especially GPUs have a great variety of setups. For example we have APUs where the local memory is just stolen system memory and all buffers must be migrate-able because you might need all of this stolen memory for scanout or page tables. In this case P2P only makes sense to avoid the migration overhead in the first place. Then you got dGPUs where only a fraction of the VRAM is accessible from the PCIe BUS. Here you also absolutely don't want to pin any buffers because that can easily crash when we need to migrate something into the visible window for CPU access. The only real option where you could do P2P with buffer pinning are those compute boards where we know that everything is always accessible to everybody and we will never need to migrate anything. But even then you want some mechanism like cgroups to take care of limiting this. Otherwise any runaway process can bring down your whole system. Key question at least for me as GPU maintainer is if we are going to see modern compute boards together with old non-ODP setups. Since those compute boards are usually used with new hardware (like PCIe v4 for example) the answer I think is most likely "no". Christian. > > Jason
On 8/24/21 11:17 PM, Christian König wrote: ... >> I think it depends on the user, if the user creates memory which is >> permanently located on the GPU then it should be pinnable in this way >> without force migration. But if the memory is inherently migratable >> then it just cannot be pinned in the GPU at all as we can't >> indefinately block migration from happening eg if the CPU touches it >> later or something. > > Yes, exactly that's the point. Especially GPUs have a great variety of setups. > > For example we have APUs where the local memory is just stolen system memory and all buffers must be > migrate-able because you might need all of this stolen memory for scanout or page tables. In this > case P2P only makes sense to avoid the migration overhead in the first place. > > Then you got dGPUs where only a fraction of the VRAM is accessible from the PCIe BUS. Here you also > absolutely don't want to pin any buffers because that can easily crash when we need to migrate > something into the visible window for CPU access. > > The only real option where you could do P2P with buffer pinning are those compute boards where we > know that everything is always accessible to everybody and we will never need to migrate anything. > But even then you want some mechanism like cgroups to take care of limiting this. Otherwise any > runaway process can bring down your whole system. > > Key question at least for me as GPU maintainer is if we are going to see modern compute boards > together with old non-ODP setups. Since those compute boards are usually used with new hardware > (like PCIe v4 for example) the answer I think is most likely "no". > That is a really good point. Times have changed and I guess ODP is on most (all?) of the new Infiniband products now, and maybe we don't need to worry so much about providing first-class support for non-ODP setups. I've got to drag my brain into 2021+! :) thanks, -- John Hubbard NVIDIA
On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote: > The only real option where you could do P2P with buffer pinning are those > compute boards where we know that everything is always accessible to > everybody and we will never need to migrate anything. But even then you want > some mechanism like cgroups to take care of limiting this. Otherwise any > runaway process can bring down your whole system. Why? It is not the pin that is the problem, it was allocating GPU dedicated memory in the first place. pinning it just changes the sequence to free it. No different than CPU memory. > Key question at least for me as GPU maintainer is if we are going to see > modern compute boards together with old non-ODP setups. You should stop thinking about it as 'old non-ODP setups'. ODP is very much a special case that allows a narrow set of functionality to work in a narrow situation. It has a high performance penalty and narrow HW support. So yes, compute boards are routinely used in scenarios where ODP is not available, today and for the foreseeable future. Jason
Am 25.08.21 um 14:18 schrieb Jason Gunthorpe: > On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote: > >> The only real option where you could do P2P with buffer pinning are those >> compute boards where we know that everything is always accessible to >> everybody and we will never need to migrate anything. But even then you want >> some mechanism like cgroups to take care of limiting this. Otherwise any >> runaway process can bring down your whole system. > > Why? It is not the pin that is the problem, it was allocating GPU > dedicated memory in the first place. pinning it just changes the > sequence to free it. No different than CPU memory. Pinning makes the memory un-evictable. In other words as long as we don't pin anything we can support as many processes as we want until we run out of swap space. Swapping sucks badly because your applications become pretty much unuseable, but you can easily recover from it by killing some process. With pinning on the other hand somebody sooner or later receives an -ENOMEM or -ENOSPC and there is no guarantee that this goes to the right process. >> Key question at least for me as GPU maintainer is if we are going to see >> modern compute boards together with old non-ODP setups. > You should stop thinking about it as 'old non-ODP setups'. ODP is > very much a special case that allows a narrow set of functionality to > work in a narrow situation. It has a high performance penalty and > narrow HW support. > > So yes, compute boards are routinely used in scenarios where ODP is > not available, today and for the foreseeable future. Yeah, that's the stuff I'm not sure about. Christian. > > Jason
On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote: > Am 25.08.21 um 14:18 schrieb Jason Gunthorpe: > > On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote: > > > > > The only real option where you could do P2P with buffer pinning are those > > > compute boards where we know that everything is always accessible to > > > everybody and we will never need to migrate anything. But even then you want > > > some mechanism like cgroups to take care of limiting this. Otherwise any > > > runaway process can bring down your whole system. > > Why? It is not the pin that is the problem, it was allocating GPU > > dedicated memory in the first place. pinning it just changes the > > sequence to free it. No different than CPU memory. > > Pinning makes the memory un-evictable. > > In other words as long as we don't pin anything we can support as many > processes as we want until we run out of swap space. Swapping sucks badly > because your applications become pretty much unuseable, but you can easily > recover from it by killing some process. > > With pinning on the other hand somebody sooner or later receives an -ENOMEM > or -ENOSPC and there is no guarantee that this goes to the right process. It is not really different - you have the same failure mode once the system runs out of swap. This is really the kernel side trying to push a policy to the user side that the user side doesn't want.. Dedicated systems are a significant use case here and should be supported, even if the same solution wouldn't be applicable to someone running a desktop. Jason
Am 25.08.21 um 14:38 schrieb Jason Gunthorpe: > On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote: >> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe: >>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote: >>> >>>> The only real option where you could do P2P with buffer pinning are those >>>> compute boards where we know that everything is always accessible to >>>> everybody and we will never need to migrate anything. But even then you want >>>> some mechanism like cgroups to take care of limiting this. Otherwise any >>>> runaway process can bring down your whole system. >>> Why? It is not the pin that is the problem, it was allocating GPU >>> dedicated memory in the first place. pinning it just changes the >>> sequence to free it. No different than CPU memory. >> Pinning makes the memory un-evictable. >> >> In other words as long as we don't pin anything we can support as many >> processes as we want until we run out of swap space. Swapping sucks badly >> because your applications become pretty much unuseable, but you can easily >> recover from it by killing some process. >> >> With pinning on the other hand somebody sooner or later receives an -ENOMEM >> or -ENOSPC and there is no guarantee that this goes to the right process. > It is not really different - you have the same failure mode once the > system runs out of swap. > > This is really the kernel side trying to push a policy to the user > side that the user side doesn't want.. But which is still the right thing to do as far as I can see. See userspace also doesn't want proper process isolation since it takes extra time. Kernel development is driven by exposing the hardware functionality in a save and manageable manner to userspace, and not by fulfilling userspace requirements. This is very important cause you otherwise you create a specialized system and not a general purpose kernel. > Dedicated systems are a significant use case here and should be > supported, even if the same solution wouldn't be applicable to someone > running a desktop. And exactly that approach is not acceptable. Christian. > > Jason
On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote: > Am 25.08.21 um 14:38 schrieb Jason Gunthorpe: > > On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote: > > > Am 25.08.21 um 14:18 schrieb Jason Gunthorpe: > > > > On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote: > > > > > > > > > The only real option where you could do P2P with buffer pinning are those > > > > > compute boards where we know that everything is always accessible to > > > > > everybody and we will never need to migrate anything. But even then you want > > > > > some mechanism like cgroups to take care of limiting this. Otherwise any > > > > > runaway process can bring down your whole system. > > > > Why? It is not the pin that is the problem, it was allocating GPU > > > > dedicated memory in the first place. pinning it just changes the > > > > sequence to free it. No different than CPU memory. > > > Pinning makes the memory un-evictable. > > > > > > In other words as long as we don't pin anything we can support as many > > > processes as we want until we run out of swap space. Swapping sucks badly > > > because your applications become pretty much unuseable, but you can easily > > > recover from it by killing some process. > > > > > > With pinning on the other hand somebody sooner or later receives an -ENOMEM > > > or -ENOSPC and there is no guarantee that this goes to the right process. > > It is not really different - you have the same failure mode once the > > system runs out of swap. > > > > This is really the kernel side trying to push a policy to the user > > side that the user side doesn't want.. > > But which is still the right thing to do as far as I can see. See userspace > also doesn't want proper process isolation since it takes extra time. Why? You are pushing a policy of resource allocation/usage which more properly belongs in userspace. > Kernel development is driven by exposing the hardware functionality in a > save and manageable manner to userspace, and not by fulfilling userspace > requirements. I don't agree with this, that is a 1980's view of OS design. So much these days in the kernel is driven entirely by boutique userspace requirements and is very much not about the classical abstract role of an OS. > > Dedicated systems are a significant use case here and should be > > supported, even if the same solution wouldn't be applicable to someone > > running a desktop. > > And exactly that approach is not acceptable. We have knobs and settings all over the place to allow Linux to support a broad set of use cases from Android to servers, to HPC. So long as they can co-exist and the various optional modes do not critically undermine the security of the kernel, it is well in line with how things have been evolving in the last 15 years. Here you are talking entirely about policy to control memory allocation, which is already well trodden ground for CPU memory. There are now endless boutique ways to deal with this, it is a very narrow view to say that GPU memory is so special and different that only one way can be the correct/allowed way. Jason
Am 25.08.21 um 16:47 schrieb Jason Gunthorpe: > On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote: >> Am 25.08.21 um 14:38 schrieb Jason Gunthorpe: >>> On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote: >>>> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe: >>>>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote: >>>>> >>>>>> The only real option where you could do P2P with buffer pinning are those >>>>>> compute boards where we know that everything is always accessible to >>>>>> everybody and we will never need to migrate anything. But even then you want >>>>>> some mechanism like cgroups to take care of limiting this. Otherwise any >>>>>> runaway process can bring down your whole system. >>>>> Why? It is not the pin that is the problem, it was allocating GPU >>>>> dedicated memory in the first place. pinning it just changes the >>>>> sequence to free it. No different than CPU memory. >>>> Pinning makes the memory un-evictable. >>>> >>>> In other words as long as we don't pin anything we can support as many >>>> processes as we want until we run out of swap space. Swapping sucks badly >>>> because your applications become pretty much unuseable, but you can easily >>>> recover from it by killing some process. >>>> >>>> With pinning on the other hand somebody sooner or later receives an -ENOMEM >>>> or -ENOSPC and there is no guarantee that this goes to the right process. >>> It is not really different - you have the same failure mode once the >>> system runs out of swap. >>> >>> This is really the kernel side trying to push a policy to the user >>> side that the user side doesn't want.. >> But which is still the right thing to do as far as I can see. See userspace >> also doesn't want proper process isolation since it takes extra time. > Why? You are pushing a policy of resource allocation/usage which > more properly belongs in userspace. > >> Kernel development is driven by exposing the hardware functionality in a >> save and manageable manner to userspace, and not by fulfilling userspace >> requirements. > I don't agree with this, that is a 1980's view of OS design. So much > these days in the kernel is driven entirely by boutique userspace > requirements and is very much not about the classical abstract role of > an OS. But it's still true never the less. Otherwise you would have libraries for filesystem accesses and no system security to speak of. >>> Dedicated systems are a significant use case here and should be >>> supported, even if the same solution wouldn't be applicable to someone >>> running a desktop. >> And exactly that approach is not acceptable. > We have knobs and settings all over the place to allow Linux to > support a broad set of use cases from Android to servers, to HPC. So > long as they can co-exist and the various optional modes do not > critically undermine the security of the kernel, it is well in line > with how things have been evolving in the last 15 years. Yeah, that's exactly what I'm talking about by adding cgroup or similar. You need a knob to control this. > Here you are talking entirely about policy to control memory > allocation, which is already well trodden ground for CPU memory. > > There are now endless boutique ways to deal with this, it is a very > narrow view to say that GPU memory is so special and different that > only one way can be the correct/allowed way. Well I'm not talking about GPU memory in particular here. This is mandatory for any memory or saying more general any resource. E.g. you are not allowed to pin large amount of system memory on a default installation for exactly those reasons as well. That you can have a knob to disable this behavior for your HPC system is perfectly fine, but I thing what Dave notes here as well that this is most likely not the default behavior we want. Christian. > > Jason
On Wed, Aug 25, 2021 at 05:14:06PM +0200, Christian König wrote: > Yeah, that's exactly what I'm talking about by adding cgroup or similar. You > need a knob to control this. We have the pinned memory ulimit today. A pinned memory cgroup might be interesting, but even containrs are covered under the ulimit (IIRC), so the driver to do this work might not be so strong. Jason
On Wed, Aug 25, 2021 at 6:14 PM Christian König <christian.koenig@amd.com> wrote: > > Am 25.08.21 um 16:47 schrieb Jason Gunthorpe: > > On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote: > >> Am 25.08.21 um 14:38 schrieb Jason Gunthorpe: > >>> On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote: > >>>> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe: > >>>>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote: > >>>>> > >>>>>> The only real option where you could do P2P with buffer pinning are those > >>>>>> compute boards where we know that everything is always accessible to > >>>>>> everybody and we will never need to migrate anything. But even then you want > >>>>>> some mechanism like cgroups to take care of limiting this. Otherwise any > >>>>>> runaway process can bring down your whole system. > >>>>> Why? It is not the pin that is the problem, it was allocating GPU > >>>>> dedicated memory in the first place. pinning it just changes the > >>>>> sequence to free it. No different than CPU memory. > >>>> Pinning makes the memory un-evictable. > >>>> > >>>> In other words as long as we don't pin anything we can support as many > >>>> processes as we want until we run out of swap space. Swapping sucks badly > >>>> because your applications become pretty much unuseable, but you can easily > >>>> recover from it by killing some process. > >>>> > >>>> With pinning on the other hand somebody sooner or later receives an -ENOMEM > >>>> or -ENOSPC and there is no guarantee that this goes to the right process. > >>> It is not really different - you have the same failure mode once the > >>> system runs out of swap. > >>> > >>> This is really the kernel side trying to push a policy to the user > >>> side that the user side doesn't want.. > >> But which is still the right thing to do as far as I can see. See userspace > >> also doesn't want proper process isolation since it takes extra time. > > Why? You are pushing a policy of resource allocation/usage which > > more properly belongs in userspace. > > > >> Kernel development is driven by exposing the hardware functionality in a > >> save and manageable manner to userspace, and not by fulfilling userspace > >> requirements. > > I don't agree with this, that is a 1980's view of OS design. So much > > these days in the kernel is driven entirely by boutique userspace > > requirements and is very much not about the classical abstract role of > > an OS. > > But it's still true never the less. Otherwise you would have libraries > for filesystem accesses and no system security to speak of. > > >>> Dedicated systems are a significant use case here and should be > >>> supported, even if the same solution wouldn't be applicable to someone > >>> running a desktop. > >> And exactly that approach is not acceptable. > > We have knobs and settings all over the place to allow Linux to > > support a broad set of use cases from Android to servers, to HPC. So > > long as they can co-exist and the various optional modes do not > > critically undermine the security of the kernel, it is well in line > > with how things have been evolving in the last 15 years. > > Yeah, that's exactly what I'm talking about by adding cgroup or similar. > You need a knob to control this. > > > Here you are talking entirely about policy to control memory > > allocation, which is already well trodden ground for CPU memory. > > > > There are now endless boutique ways to deal with this, it is a very > > narrow view to say that GPU memory is so special and different that > > only one way can be the correct/allowed way. > > Well I'm not talking about GPU memory in particular here. This is > mandatory for any memory or saying more general any resource. > > E.g. you are not allowed to pin large amount of system memory on a > default installation for exactly those reasons as well. Unless you are working on a h/w architecture that is designed specifically for a single user. At least in our h/w architecture, we aim for 100% utilization of the h/w when you are running DL training workloads, as opposed to what is happening inside a GPU when you are running training. Therefore, it is counter-productive to serve multiple users from a performance perspective. In fact, the training problem is so large, that a single ASIC is hardly sufficient, and you need to run on anywhere between 8 ASICs to 64-128 ASICs to get a reasonable time for training. So there is no point in serving multiple users in that case and that's why our device memory (HBM) is always pinned. This is totally different from the usual role of a GPU, where you serve the desktop and multiple other applications that draw on the screen. I wouldn't want some knob in the kernel to control what is the limit of memory I can pin or not. I just don't think it is useful and/or user friendly. btw, regarding ODP, we don't support it in h/w (yet) because 99% of the use-cases in DL training will suffer from performance issues if we don't pin the host memory. There are extreme topologies, with 10TB of parameters that may require this, but like I said, it's very rare and not worth the effort. Thanks, Oded > > That you can have a knob to disable this behavior for your HPC system is > perfectly fine, but I thing what Dave notes here as well that this is > most likely not the default behavior we want. > > Christian. > > > > > Jason >
On 24/08/2021 20:32, Jason Gunthorpe wrote: > On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote: >> On 8/24/21 2:32 AM, Christian König wrote: >>> Am 24.08.21 um 11:06 schrieb Gal Pressman: >>>> On 23/08/2021 13:43, Christian König wrote: >>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman: >>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote: >>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: >> ... >>>>>> IIUC, we're talking about three different exporter "types": >>>>>> - Dynamic with move_notify (requires ODP) >>>>>> - Dynamic with revoke_notify >>>>>> - Static >>>>>> >>>>>> Which changes do we need to make the third one work? >>>>> Basically none at all in the framework. >>>>> >>>>> You just need to properly use the dma_buf_pin() function when you start using a >>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function >>>>> after you are done with the DMA-buf. >>>> I replied to your previous mail, but I'll ask again. >>>> Doesn't the pin operation migrate the memory to host memory? >>> >>> Sorry missed your previous reply. >>> >>> And yes at least for the amdgpu driver we migrate the memory to host >>> memory as soon as it is pinned and I would expect that other GPU drivers >>> do something similar. >> >> Well...for many topologies, migrating to host memory will result in a >> dramatically slower p2p setup. For that reason, some GPU drivers may >> want to allow pinning of video memory in some situations. >> >> Ideally, you've got modern ODP devices and you don't even need to pin. >> But if not, and you still hope to do high performance p2p between a GPU >> and a non-ODP Infiniband device, then you would need to leave the pinned >> memory in vidmem. >> >> So I think we don't want to rule out that behavior, right? Or is the >> thinking more like, "you're lucky that this old non-ODP setup works at >> all, and we'll make it work by routing through host/cpu memory, but it >> will be slow"? > > I think it depends on the user, if the user creates memory which is > permanently located on the GPU then it should be pinnable in this way > without force migration. But if the memory is inherently migratable > then it just cannot be pinned in the GPU at all as we can't > indefinately block migration from happening eg if the CPU touches it > later or something. So are we OK with exporters implementing dma_buf_pin() without migrating the memory? If so, do we still want a move_notify callback for non-dynamic importers? A noop?
Am 01.09.21 um 13:20 schrieb Gal Pressman: > On 24/08/2021 20:32, Jason Gunthorpe wrote: >> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote: >>> On 8/24/21 2:32 AM, Christian König wrote: >>>> Am 24.08.21 um 11:06 schrieb Gal Pressman: >>>>> On 23/08/2021 13:43, Christian König wrote: >>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman: >>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote: >>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: >>> ... >>>>>>> IIUC, we're talking about three different exporter "types": >>>>>>> - Dynamic with move_notify (requires ODP) >>>>>>> - Dynamic with revoke_notify >>>>>>> - Static >>>>>>> >>>>>>> Which changes do we need to make the third one work? >>>>>> Basically none at all in the framework. >>>>>> >>>>>> You just need to properly use the dma_buf_pin() function when you start using a >>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function >>>>>> after you are done with the DMA-buf. >>>>> I replied to your previous mail, but I'll ask again. >>>>> Doesn't the pin operation migrate the memory to host memory? >>>> Sorry missed your previous reply. >>>> >>>> And yes at least for the amdgpu driver we migrate the memory to host >>>> memory as soon as it is pinned and I would expect that other GPU drivers >>>> do something similar. >>> Well...for many topologies, migrating to host memory will result in a >>> dramatically slower p2p setup. For that reason, some GPU drivers may >>> want to allow pinning of video memory in some situations. >>> >>> Ideally, you've got modern ODP devices and you don't even need to pin. >>> But if not, and you still hope to do high performance p2p between a GPU >>> and a non-ODP Infiniband device, then you would need to leave the pinned >>> memory in vidmem. >>> >>> So I think we don't want to rule out that behavior, right? Or is the >>> thinking more like, "you're lucky that this old non-ODP setup works at >>> all, and we'll make it work by routing through host/cpu memory, but it >>> will be slow"? >> I think it depends on the user, if the user creates memory which is >> permanently located on the GPU then it should be pinnable in this way >> without force migration. But if the memory is inherently migratable >> then it just cannot be pinned in the GPU at all as we can't >> indefinately block migration from happening eg if the CPU touches it >> later or something. > So are we OK with exporters implementing dma_buf_pin() without migrating the memory? I think so, yes. > If so, do we still want a move_notify callback for non-dynamic importers? A noop? Well we could make the move_notify callback optional, e.g. so that you get the new locking approach but still pin the buffers manually with dma_buf_pin(). Regards, Christian.
On 01/09/2021 14:24, Christian König wrote: > > > Am 01.09.21 um 13:20 schrieb Gal Pressman: >> On 24/08/2021 20:32, Jason Gunthorpe wrote: >>> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote: >>>> On 8/24/21 2:32 AM, Christian König wrote: >>>>> Am 24.08.21 um 11:06 schrieb Gal Pressman: >>>>>> On 23/08/2021 13:43, Christian König wrote: >>>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman: >>>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote: >>>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote: >>>> ... >>>>>>>> IIUC, we're talking about three different exporter "types": >>>>>>>> - Dynamic with move_notify (requires ODP) >>>>>>>> - Dynamic with revoke_notify >>>>>>>> - Static >>>>>>>> >>>>>>>> Which changes do we need to make the third one work? >>>>>>> Basically none at all in the framework. >>>>>>> >>>>>>> You just need to properly use the dma_buf_pin() function when you start >>>>>>> using a >>>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() >>>>>>> function >>>>>>> after you are done with the DMA-buf. >>>>>> I replied to your previous mail, but I'll ask again. >>>>>> Doesn't the pin operation migrate the memory to host memory? >>>>> Sorry missed your previous reply. >>>>> >>>>> And yes at least for the amdgpu driver we migrate the memory to host >>>>> memory as soon as it is pinned and I would expect that other GPU drivers >>>>> do something similar. >>>> Well...for many topologies, migrating to host memory will result in a >>>> dramatically slower p2p setup. For that reason, some GPU drivers may >>>> want to allow pinning of video memory in some situations. >>>> >>>> Ideally, you've got modern ODP devices and you don't even need to pin. >>>> But if not, and you still hope to do high performance p2p between a GPU >>>> and a non-ODP Infiniband device, then you would need to leave the pinned >>>> memory in vidmem. >>>> >>>> So I think we don't want to rule out that behavior, right? Or is the >>>> thinking more like, "you're lucky that this old non-ODP setup works at >>>> all, and we'll make it work by routing through host/cpu memory, but it >>>> will be slow"? >>> I think it depends on the user, if the user creates memory which is >>> permanently located on the GPU then it should be pinnable in this way >>> without force migration. But if the memory is inherently migratable >>> then it just cannot be pinned in the GPU at all as we can't >>> indefinately block migration from happening eg if the CPU touches it >>> later or something. >> So are we OK with exporters implementing dma_buf_pin() without migrating the >> memory? > > I think so, yes. > >> If so, do we still want a move_notify callback for non-dynamic importers? A noop? > > Well we could make the move_notify callback optional, e.g. so that you get the > new locking approach but still pin the buffers manually with dma_buf_pin(). Thanks Christian! So the end result will look similar to the original patch I posted, where peer2peer can be enabled without providing move_notify, correct?
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 511fe0d217a0..e3faad8f492c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -727,7 +727,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL); - if (WARN_ON(importer_ops && !importer_ops->move_notify)) + if (WARN_ON(importer_ops && !importer_ops->move_notify && + dma_buf_is_dynamic(attach->dmabuf))) return ERR_PTR(-EINVAL); attach = kzalloc(sizeof(*attach), GFP_KERNEL); @@ -1048,7 +1049,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf) dma_resv_assert_held(dmabuf->resv); list_for_each_entry(attach, &dmabuf->attachments, node) - if (attach->importer_ops) + if (attach->importer_ops && attach->importer_ops->move_notify) attach->importer_ops->move_notify(attach); } EXPORT_SYMBOL_GPL(dma_buf_move_notify); diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c index c6e875619fac..c502ae828bd3 100644 --- a/drivers/infiniband/core/umem_dmabuf.c +++ b/drivers/infiniband/core/umem_dmabuf.c @@ -118,7 +118,7 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device, if (check_add_overflow(offset, (unsigned long)size, &end)) return ret; - if (unlikely(!ops || !ops->move_notify)) + if (unlikely(!ops)) return ret; dmabuf = dma_buf_get(fd); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index efdc56b9d95f..4b2e99012cb1 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -473,7 +473,7 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf) static inline bool dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) { - return !!attach->importer_ops; + return !!attach->importer_ops->move_notify; } struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
Hey all, Currently, the RDMA subsystem can only work with dynamic dmabuf attachments, which requires the RDMA device to support on-demand-paging (ODP) which is not common on most devices (only supported by mlx5). While the dynamic requirement makes sense for certain GPUs, some devices (such as habanalabs) have device memory that is always "pinned" and do not need/use the move_notify operation. The motivation of this RFC is to use habanalabs as the dmabuf exporter, and EFA as the importer to allow for peer2peer access through libibverbs. This draft patch changes the dmabuf driver to differentiate between static/dynamic attachments by looking at the move_notify op instead of the importer_ops struct, and allowing the peer2peer flag to be enabled in case of a static exporter. Thanks Signed-off-by: Gal Pressman <galpress@amazon.com> --- drivers/dma-buf/dma-buf.c | 5 +++-- drivers/infiniband/core/umem_dmabuf.c | 2 +- include/linux/dma-buf.h | 2 +- 3 files changed, 5 insertions(+), 4 deletions(-)