mbox series

[v2,0/2] NVMe passthrough: Take into account host IOVA reserved regions

Message ID 20200929085550.30926-1-eric.auger@redhat.com
Headers show
Series NVMe passthrough: Take into account host IOVA reserved regions | expand

Message

Eric Auger Sept. 29, 2020, 8:55 a.m. UTC
The current IOVA allocator allocates within the [0x10000, 1ULL << 39]
window, without paying attention to the host IOVA reserved regions.
This prevents NVMe passthtrough from working on ARM as the fixed
IOVAs rapidly grow up to the MSI reserved region [0x8000000, 0x8100000]
causing some VFIO MAP DMA failures. This series collects the usable
IOVA regions using VFIO GET_INFO (this requires the host to support
VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE) and rework the fixed and
temporary IOVA allocators to avoid those latter.

For the time being we do not change the arbitrary min/max IOVAs.
In theory they could be dynamically determined but the kernel
currently fails to expose some HW limitations described in the ACPI
tables (such as PCI root complex Device Memory Address Size Limit).
See kernel thread related to "[RFC 0/3] iommu: Reserved regions for
IOVAs beyond dma_mask and iommu aperture" for more details:
https://lkml.org/lkml/2020/9/28/1102

Best Regards

Eric

This series can be found at:
https://github.com/eauger/qemu/tree/nvme_resv_v2

This was tested on ARM only.

History:
v1 -> v2:
- remove "util/vfio-helpers: Dynamically compute the min/max IOVA" to
  relax the kernel dependency
- Fix cabapbility enumeration loop
- set s->usable_iova_ranges=NULL to avoid double free
- handle possible u64 wrap

Eric Auger (2):
  util/vfio-helpers: Collect IOVA reserved regions
  util/vfio-helpers: Rework the IOVA allocator to avoid IOVA reserved
    regions

 util/vfio-helpers.c | 129 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 123 insertions(+), 6 deletions(-)

Comments

Stefan Hajnoczi Sept. 29, 2020, 3:59 p.m. UTC | #1
On Tue, Sep 29, 2020 at 10:55:50AM +0200, Eric Auger wrote:
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index ba0ee6e21c..71145970f3 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -667,6 +667,50 @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
>      return true;
>  }
>  
> +static int
> +qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
> +{
> +    int i;
> +
> +    for (i = 0; i < s->nb_iova_ranges; i++) {
> +        if (s->usable_iova_ranges[i].end < s->low_water_mark) {
> +            continue;
> +        }
> +        s->low_water_mark =
> +            MAX(s->low_water_mark, s->usable_iova_ranges[i].start);
> +
> +        if (s->usable_iova_ranges[i].end - s->low_water_mark + 1 >= size ||
> +            s->usable_iova_ranges[i].end - s->low_water_mark + 1 == 0) {

I don't understand the == 0 case. It seems like we are allocating an
IOVA beyond usable_iova_ranges[i].end?
Eric Auger Sept. 29, 2020, 7:44 p.m. UTC | #2
Hi Stefan,

On 9/29/20 5:59 PM, Stefan Hajnoczi wrote:
> On Tue, Sep 29, 2020 at 10:55:50AM +0200, Eric Auger wrote:
>> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
>> index ba0ee6e21c..71145970f3 100644
>> --- a/util/vfio-helpers.c
>> +++ b/util/vfio-helpers.c
>> @@ -667,6 +667,50 @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
>>      return true;
>>  }
>>  
>> +static int
>> +qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < s->nb_iova_ranges; i++) {
>> +        if (s->usable_iova_ranges[i].end < s->low_water_mark) {
>> +            continue;
>> +        }
>> +        s->low_water_mark =
>> +            MAX(s->low_water_mark, s->usable_iova_ranges[i].start);
>> +
>> +        if (s->usable_iova_ranges[i].end - s->low_water_mark + 1 >= size ||
>> +            s->usable_iova_ranges[i].end - s->low_water_mark + 1 == 0) {
> 
> I don't understand the == 0 case. It seems like we are allocating an
> IOVA beyond usable_iova_ranges[i].end?>
It is meant to handle the case were low_water_mark = 0 and
s->usable_iova_ranges[0].end = ULLONG_MAX (I know it cannot exist at the
moment but may happen in the future) where we get an overflow. Given the
        if (s->usable_iova_ranges[i].end < s->low_water_mark) {
            continue;
        }
I think this prevents us from allocating beyond
usable_iova_ranges[i].end or do I miss something?

Thanks

Eric
Stefan Hajnoczi Sept. 30, 2020, 9:18 a.m. UTC | #3
On Tue, Sep 29, 2020 at 09:44:48PM +0200, Auger Eric wrote:
> Hi Stefan,
> 
> On 9/29/20 5:59 PM, Stefan Hajnoczi wrote:
> > On Tue, Sep 29, 2020 at 10:55:50AM +0200, Eric Auger wrote:
> >> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> >> index ba0ee6e21c..71145970f3 100644
> >> --- a/util/vfio-helpers.c
> >> +++ b/util/vfio-helpers.c
> >> @@ -667,6 +667,50 @@ static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
> >>      return true;
> >>  }
> >>  
> >> +static int
> >> +qemu_vfio_find_fixed_iova(QEMUVFIOState *s, size_t size, uint64_t *iova)
> >> +{
> >> +    int i;
> >> +
> >> +    for (i = 0; i < s->nb_iova_ranges; i++) {
> >> +        if (s->usable_iova_ranges[i].end < s->low_water_mark) {
> >> +            continue;
> >> +        }
> >> +        s->low_water_mark =
> >> +            MAX(s->low_water_mark, s->usable_iova_ranges[i].start);
> >> +
> >> +        if (s->usable_iova_ranges[i].end - s->low_water_mark + 1 >= size ||
> >> +            s->usable_iova_ranges[i].end - s->low_water_mark + 1 == 0) {
> > 
> > I don't understand the == 0 case. It seems like we are allocating an
> > IOVA beyond usable_iova_ranges[i].end?>
> It is meant to handle the case were low_water_mark = 0 and
> s->usable_iova_ranges[0].end = ULLONG_MAX (I know it cannot exist at the
> moment but may happen in the future) where we get an overflow. Given the
>         if (s->usable_iova_ranges[i].end < s->low_water_mark) {
>             continue;
>         }
> I think this prevents us from allocating beyond
> usable_iova_ranges[i].end or do I miss something?

Yes, you are right. Here are the constraints:

e >= l
j = max(l, s)
e - j + 1 < s
e - j + 1 == 0

Assume l >= s so we can replace j with l:

e >= l
e - l + 1 < s
e - l + 1 == 0

The case I'm worried about is when the iova range cannot fit s bytes.
The last condition is only true when e = l - 1, but this violates the
first condition e >= l.

So the problem scenario cannot occur.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Stefan Hajnoczi Sept. 30, 2020, 9:23 a.m. UTC | #4
On Tue, Sep 29, 2020 at 10:55:48AM +0200, Eric Auger wrote:
> The current IOVA allocator allocates within the [0x10000, 1ULL << 39]

> window, without paying attention to the host IOVA reserved regions.

> This prevents NVMe passthtrough from working on ARM as the fixed

> IOVAs rapidly grow up to the MSI reserved region [0x8000000, 0x8100000]

> causing some VFIO MAP DMA failures. This series collects the usable

> IOVA regions using VFIO GET_INFO (this requires the host to support

> VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE) and rework the fixed and

> temporary IOVA allocators to avoid those latter.

> 

> For the time being we do not change the arbitrary min/max IOVAs.

> In theory they could be dynamically determined but the kernel

> currently fails to expose some HW limitations described in the ACPI

> tables (such as PCI root complex Device Memory Address Size Limit).

> See kernel thread related to "[RFC 0/3] iommu: Reserved regions for

> IOVAs beyond dma_mask and iommu aperture" for more details:

> https://lkml.org/lkml/2020/9/28/1102

> 

> Best Regards

> 

> Eric

> 

> This series can be found at:

> https://github.com/eauger/qemu/tree/nvme_resv_v2

> 

> This was tested on ARM only.

> 

> History:

> v1 -> v2:

> - remove "util/vfio-helpers: Dynamically compute the min/max IOVA" to

>   relax the kernel dependency

> - Fix cabapbility enumeration loop

> - set s->usable_iova_ranges=NULL to avoid double free

> - handle possible u64 wrap

> 

> Eric Auger (2):

>   util/vfio-helpers: Collect IOVA reserved regions

>   util/vfio-helpers: Rework the IOVA allocator to avoid IOVA reserved

>     regions

> 

>  util/vfio-helpers.c | 129 +++++++++++++++++++++++++++++++++++++++++---

>  1 file changed, 123 insertions(+), 6 deletions(-)

> 

> -- 

> 2.21.3

> 


Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan