virtiofsd: Use --thread-pool-size=0 to mean no thread pool

Message ID	20201105194416.GA1384085@redhat.com
State	Accepted
Commit	e49393a349925567cb83cfe810bc595e42a17883
Headers	show Return-Path: <SRS0=LLOU=EL=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 16EE120704 Date: Thu, 5 Nov 2020 14:44:16 -0500 From: Vivek Goyal <vgoyal@redhat.com> To: qemu-devel@nongnu.org, virtio-fs-list <virtio-fs@redhat.com> Subject: [PATCH] virtiofsd: Use --thread-pool-size=0 to mean no thread pool Message-ID: <20201105194416.GA1384085@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Received-SPF: pass client-ip=63.128.21.124; envelope-from=vgoyal@redhat.com; helo=us-smtp-delivery-124.mimecast.com Precedence: list Cc: jose.carlos.venegas.munoz@intel.com, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	virtiofsd: Use --thread-pool-size=0 to mean no thread pool \| expand virtiofsd: Use --thread-pool-size=0 to mean no thread pool

Vivek Goyal Nov. 5, 2020, 7:44 p.m. UTC

Right now we create a thread pool and main thread hands over the request
to thread in thread pool to process. Number of threads in thread pool
can be managed by option --thread-pool-size.

There is a chance that in case of some workloads, we might get better
performance if we don't handover the request to a different thread
and process in the context of thread receiving the request.

To implement that, redefine the meaning of --thread-pool-size=0 to
mean that don't use a thread pool. Instead process the request in
the context of thread receiving request from the queue.

I can't think how --thread-pool-size=0 is useful and hence using
that. If it is already useful somehow, I could look at defining
a new option say "--no-thread-pool".

I think this patch will be used more as a debug help to do comparison
when it is more effecient to do not hand over the requests to a
thread pool.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 tools/virtiofsd/fuse_virtio.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

Vivek Goyal Nov. 5, 2020, 7:52 p.m. UTC | #1

On Thu, Nov 05, 2020 at 02:44:16PM -0500, Vivek Goyal wrote:
> Right now we create a thread pool and main thread hands over the request

> to thread in thread pool to process. Number of threads in thread pool

> can be managed by option --thread-pool-size.

> 

> There is a chance that in case of some workloads, we might get better

> performance if we don't handover the request to a different thread

> and process in the context of thread receiving the request.

> 

> To implement that, redefine the meaning of --thread-pool-size=0 to

> mean that don't use a thread pool. Instead process the request in

> the context of thread receiving request from the queue.

> 

> I can't think how --thread-pool-size=0 is useful and hence using

> that. If it is already useful somehow, I could look at defining

> a new option say "--no-thread-pool".

> 

> I think this patch will be used more as a debug help to do comparison

> when it is more effecient to do not hand over the requests to a

> thread pool.


I ran virtiofs-tests to comapre --thread-pool-size=0 and
--thread-pool-size=64. And results seem to be all over the place. In
some cases thread pool seems to perform batter and in other cases
no-thread-pool seems to perform better.

But in general it looks like that except for the case of libaio workload,
no-thread-pool is helping.

Thanks
Vivek

NAME                    WORKLOAD                Bandwidth       IOPS            
thread-pool             seqread-psync           682.4mb         170.6k          
no-thread-pool          seqread-psync           679.3mb         169.8k          

thread-pool             seqread-psync-multi     2415.9mb        603.9k          
no-thread-pool          seqread-psync-multi     2528.5mb        632.1k          

thread-pool             seqread-mmap            591.7mb         147.9k          
no-thread-pool          seqread-mmap            595.6mb         148.9k          

thread-pool             seqread-mmap-multi      2195.3mb        548.8k          
no-thread-pool          seqread-mmap-multi      2286.1mb        571.5k          

thread-pool             seqread-libaio          329.1mb         82.2k           
no-thread-pool          seqread-libaio          271.5mb         67.8k           

thread-pool             seqread-libaio-multi    1387.1mb        346.7k          
no-thread-pool          seqread-libaio-multi    1508.2mb        377.0k          

thread-pool             randread-psync          59.0mb          14.7k           
no-thread-pool          randread-psync          78.5mb          19.6k           

thread-pool             randread-psync-multi    226.4mb         56.6k           
no-thread-pool          randread-psync-multi    289.2mb         72.3k           

thread-pool             randread-mmap           48.4mb          12.1k           
no-thread-pool          randread-mmap           57.4mb          14.3k           

thread-pool             randread-mmap-multi     183.5mb         45.8k           
no-thread-pool          randread-mmap-multi     240.5mb         60.1k           

thread-pool             randread-libaio         329.4mb         82.3k           
no-thread-pool          randread-libaio         261.8mb         65.4k           

thread-pool             randread-libaio-multi   252.1mb         63.0k           
no-thread-pool          randread-libaio-multi   278.3mb         69.5k           

thread-pool             seqwrite-psync          54.9mb          13.7k           
no-thread-pool          seqwrite-psync          77.8mb          19.4k           

thread-pool             seqwrite-psync-multi    219.9mb         54.9k           
no-thread-pool          seqwrite-psync-multi    314.8mb         78.7k           

thread-pool             seqwrite-mmap           191.7mb         47.9k           
no-thread-pool          seqwrite-mmap           201.4mb         50.3k           

thread-pool             seqwrite-mmap-multi     346.6mb         86.6k           
no-thread-pool          seqwrite-mmap-multi     387.6mb         96.9k           

thread-pool             seqwrite-libaio         306.4mb         76.6k           
no-thread-pool          seqwrite-libaio         248.2mb         62.0k           

thread-pool             seqwrite-libaio-multi   328.5mb         82.1k           
no-thread-pool          seqwrite-libaio-multi   305.6mb         76.4k           

thread-pool             randwrite-psync         55.6mb          13.9k           
no-thread-pool          randwrite-psync         81.2mb          20.3k           

thread-pool             randwrite-psync-multi   227.0mb         56.7k           
no-thread-pool          randwrite-psync-multi   311.6mb         77.8k           

thread-pool             randwrite-mmap          35.3mb          9038            
no-thread-pool          randwrite-mmap          58.1mb          14.5k           

thread-pool             randwrite-mmap-multi    140.8mb         35.2k           
no-thread-pool          randwrite-mmap-multi    210.5mb         52.6k           

thread-pool             randwrite-libaio        307.1mb         76.7k           
no-thread-pool          randwrite-libaio        279.4mb         69.8k           

thread-pool             randwrite-libaio-multi  361.3mb         90.3k           
no-thread-pool          randwrite-libaio-multi  302.6mb         75.6k           

thread-pool             randrw-psync            34.1mb/11.4mb   8754/2929       
no-thread-pool          randrw-psync            38.7mb/12.9mb   9919/3321       

thread-pool             randrw-psync-multi      126.2mb/42.3mb  31.5k/10.5k     
no-thread-pool          randrw-psync-multi      172.2mb/57.7mb  43.0k/14.4k     

thread-pool             randrw-mmap             24.1mb/8270kb   6194/2067       
no-thread-pool          randrw-mmap             42.0mb/14.0mb   10.5k/3606      

thread-pool             randrw-mmap-multi       127.9mb/42.8mb  31.9k/10.7k     
no-thread-pool          randrw-mmap-multi       179.4mb/60.0mb  44.8k/15.0k     

thread-pool             randrw-libaio           64.0mb/21.4mb   16.0k/5485      
no-thread-pool          randrw-libaio           79.6mb/26.6mb   19.9k/6831      

thread-pool             randrw-libaio-multi     174.2mb/58.3mb  43.5k/14.5k     
no-thread-pool          randrw-libaio-multi     201.6mb/67.5mb  50.4k/16.8k

Venegas Munoz, Jose Carlos Nov. 6, 2020, 8:33 p.m. UTC | #2

Hi Vivek,

I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. 

The fio config does  randrw:

fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75

- I can see better results with this patch
- 9pfs is still better in the case of Kata because of the use of:
https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch

Results:
./fio-results-run_virtiofs_tread_pool_0
   READ: bw=42.8MiB/s (44.8MB/s), 42.8MiB/s-42.8MiB/s (44.8MB/s-44.8MB/s), io=150MiB (157MB), run=3507-3507msec
  WRITE: bw=14.3MiB/s (14.9MB/s), 14.3MiB/s-14.3MiB/s (14.9MB/s-14.9MB/s), io=49.0MiB (52.4MB), run=3507-3507msec
./fio-results-run_9pfs
   READ: bw=55.1MiB/s (57.8MB/s), 55.1MiB/s-55.1MiB/s (57.8MB/s-57.8MB/s), io=150MiB (157MB), run=2722-2722msec
  WRITE: bw=18.4MiB/s (19.3MB/s), 18.4MiB/s-18.4MiB/s (19.3MB/s-19.3MB/s), io=49.0MiB (52.4MB), run=2722-2722msec
./fio-results-run_virtiofs_tread_pool_1
   READ: bw=34.5MiB/s (36.1MB/s), 34.5MiB/s-34.5MiB/s (36.1MB/s-36.1MB/s), io=150MiB (157MB), run=4354-4354msec
  WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=49.0MiB (52.4MB), run=4354-4354msec
./fio-results-run_virtiofs_tread_pool_64
   READ: bw=32.3MiB/s (33.8MB/s), 32.3MiB/s-32.3MiB/s (33.8MB/s-33.8MB/s), io=150MiB (157MB), run=4648-4648msec
  WRITE: bw=10.8MiB/s (11.3MB/s), 10.8MiB/s-10.8MiB/s (11.3MB/s-11.3MB/s), io=49.0MiB (52.4MB), run=4648-4648msec

Next: 
- run https://github.com/rhvgoyal/virtiofs-tests for  tread_pool_0, tread_pool_64, and 9pfs
- Test https://lore.kernel.org/linux-fsdevel/20201009181512.65496-1-vgoyal@redhat.com/

All the testing  for kata is based in:
https://github.com/jcvenegas/mrunner/blob/master/scripts/bash_workloads/build-qemu-and-run-fio-test.sh

I ran this using an azure VM.

Cheers,
Carlos


On 05/11/20 13:53, "Vivek Goyal" <vgoyal@redhat.com> wrote:

    On Thu, Nov 05, 2020 at 02:44:16PM -0500, Vivek Goyal wrote:
    > Right now we create a thread pool and main thread hands over the request

    > to thread in thread pool to process. Number of threads in thread pool

    > can be managed by option --thread-pool-size.

    > 

    > There is a chance that in case of some workloads, we might get better

    > performance if we don't handover the request to a different thread

    > and process in the context of thread receiving the request.

    > 

    > To implement that, redefine the meaning of --thread-pool-size=0 to

    > mean that don't use a thread pool. Instead process the request in

    > the context of thread receiving request from the queue.

    > 

    > I can't think how --thread-pool-size=0 is useful and hence using

    > that. If it is already useful somehow, I could look at defining

    > a new option say "--no-thread-pool".

    > 

    > I think this patch will be used more as a debug help to do comparison

    > when it is more effecient to do not hand over the requests to a

    > thread pool.


    I ran virtiofs-tests to comapre --thread-pool-size=0 and
    --thread-pool-size=64. And results seem to be all over the place. In
    some cases thread pool seems to perform batter and in other cases
    no-thread-pool seems to perform better.

    But in general it looks like that except for the case of libaio workload,
    no-thread-pool is helping.

    Thanks
    Vivek

    NAME                    WORKLOAD                Bandwidth       IOPS            
    thread-pool             seqread-psync           682.4mb         170.6k          
    no-thread-pool          seqread-psync           679.3mb         169.8k          

    thread-pool             seqread-psync-multi     2415.9mb        603.9k          
    no-thread-pool          seqread-psync-multi     2528.5mb        632.1k          

    thread-pool             seqread-mmap            591.7mb         147.9k          
    no-thread-pool          seqread-mmap            595.6mb         148.9k          

    thread-pool             seqread-mmap-multi      2195.3mb        548.8k          
    no-thread-pool          seqread-mmap-multi      2286.1mb        571.5k          

    thread-pool             seqread-libaio          329.1mb         82.2k           
    no-thread-pool          seqread-libaio          271.5mb         67.8k           

    thread-pool             seqread-libaio-multi    1387.1mb        346.7k          
    no-thread-pool          seqread-libaio-multi    1508.2mb        377.0k          

    thread-pool             randread-psync          59.0mb          14.7k           
    no-thread-pool          randread-psync          78.5mb          19.6k           

    thread-pool             randread-psync-multi    226.4mb         56.6k           
    no-thread-pool          randread-psync-multi    289.2mb         72.3k           

    thread-pool             randread-mmap           48.4mb          12.1k           
    no-thread-pool          randread-mmap           57.4mb          14.3k           

    thread-pool             randread-mmap-multi     183.5mb         45.8k           
    no-thread-pool          randread-mmap-multi     240.5mb         60.1k           

    thread-pool             randread-libaio         329.4mb         82.3k           
    no-thread-pool          randread-libaio         261.8mb         65.4k           

    thread-pool             randread-libaio-multi   252.1mb         63.0k           
    no-thread-pool          randread-libaio-multi   278.3mb         69.5k           

    thread-pool             seqwrite-psync          54.9mb          13.7k           
    no-thread-pool          seqwrite-psync          77.8mb          19.4k           

    thread-pool             seqwrite-psync-multi    219.9mb         54.9k           
    no-thread-pool          seqwrite-psync-multi    314.8mb         78.7k           

    thread-pool             seqwrite-mmap           191.7mb         47.9k           
    no-thread-pool          seqwrite-mmap           201.4mb         50.3k           

    thread-pool             seqwrite-mmap-multi     346.6mb         86.6k           
    no-thread-pool          seqwrite-mmap-multi     387.6mb         96.9k           

    thread-pool             seqwrite-libaio         306.4mb         76.6k           
    no-thread-pool          seqwrite-libaio         248.2mb         62.0k           

    thread-pool             seqwrite-libaio-multi   328.5mb         82.1k           
    no-thread-pool          seqwrite-libaio-multi   305.6mb         76.4k           

    thread-pool             randwrite-psync         55.6mb          13.9k           
    no-thread-pool          randwrite-psync         81.2mb          20.3k           

    thread-pool             randwrite-psync-multi   227.0mb         56.7k           
    no-thread-pool          randwrite-psync-multi   311.6mb         77.8k           

    thread-pool             randwrite-mmap          35.3mb          9038            
    no-thread-pool          randwrite-mmap          58.1mb          14.5k           

    thread-pool             randwrite-mmap-multi    140.8mb         35.2k           
    no-thread-pool          randwrite-mmap-multi    210.5mb         52.6k           

    thread-pool             randwrite-libaio        307.1mb         76.7k           
    no-thread-pool          randwrite-libaio        279.4mb         69.8k           

    thread-pool             randwrite-libaio-multi  361.3mb         90.3k           
    no-thread-pool          randwrite-libaio-multi  302.6mb         75.6k           

    thread-pool             randrw-psync            34.1mb/11.4mb   8754/2929       
    no-thread-pool          randrw-psync            38.7mb/12.9mb   9919/3321       

    thread-pool             randrw-psync-multi      126.2mb/42.3mb  31.5k/10.5k     
    no-thread-pool          randrw-psync-multi      172.2mb/57.7mb  43.0k/14.4k     

    thread-pool             randrw-mmap             24.1mb/8270kb   6194/2067       
    no-thread-pool          randrw-mmap             42.0mb/14.0mb   10.5k/3606      

    thread-pool             randrw-mmap-multi       127.9mb/42.8mb  31.9k/10.7k     
    no-thread-pool          randrw-mmap-multi       179.4mb/60.0mb  44.8k/15.0k     

    thread-pool             randrw-libaio           64.0mb/21.4mb   16.0k/5485      
    no-thread-pool          randrw-libaio           79.6mb/26.6mb   19.9k/6831      

    thread-pool             randrw-libaio-multi     174.2mb/58.3mb  43.5k/14.5k     
    no-thread-pool          randrw-libaio-multi     201.6mb/67.5mb  50.4k/16.8k

Vivek Goyal Nov. 6, 2020, 10:35 p.m. UTC | #3

On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote:
> Hi Vivek,

> 

> I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. 

> 

> The fio config does  randrw:

> 

> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75

> 


Hi Carlos,

Thanks for the testing. 

So basically two conclusions from your tests.

- for virtiofs, --thread-pool-size=0 is performing better as comapred
  to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately
  35-40% better.

- virtio-9p is still approximately 30% better than virtiofs
  --thread-pool-size=0.

As I had done the analysis that this particular workload (mixed read and
write) is bad with virtiofs because after every write we are invalidating
attrs and cache so next read ends up fetching attrs again. I had posted
patches to gain some of the performance.

https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/

But I got the feedback to look into implementing file leases instead.

Anyway, good to know that --thread-pool-size=0 narrows the performance
gap substantially (more than half) for this workload. I will look into
identifying further bottlenecks.

Given in my tests, most of the workload benefited from
--thread-pool-size=0, may be this can be the default and user can
opt-in for a thread pool (--thread-pool-size=64) once it is known
that a particular workload benefits from it.

David, Stefan, WDYT?

Thanks
Vivek

> - I can see better results with this patch

> - 9pfs is still better in the case of Kata because of the use of:

> https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch

> 

> Results:

> ./fio-results-run_virtiofs_tread_pool_0

>    READ: bw=42.8MiB/s (44.8MB/s), 42.8MiB/s-42.8MiB/s (44.8MB/s-44.8MB/s), io=150MiB (157MB), run=3507-3507msec

>   WRITE: bw=14.3MiB/s (14.9MB/s), 14.3MiB/s-14.3MiB/s (14.9MB/s-14.9MB/s), io=49.0MiB (52.4MB), run=3507-3507msec

> ./fio-results-run_9pfs

>    READ: bw=55.1MiB/s (57.8MB/s), 55.1MiB/s-55.1MiB/s (57.8MB/s-57.8MB/s), io=150MiB (157MB), run=2722-2722msec

>   WRITE: bw=18.4MiB/s (19.3MB/s), 18.4MiB/s-18.4MiB/s (19.3MB/s-19.3MB/s), io=49.0MiB (52.4MB), run=2722-2722msec

> ./fio-results-run_virtiofs_tread_pool_1

>    READ: bw=34.5MiB/s (36.1MB/s), 34.5MiB/s-34.5MiB/s (36.1MB/s-36.1MB/s), io=150MiB (157MB), run=4354-4354msec

>   WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=49.0MiB (52.4MB), run=4354-4354msec

> ./fio-results-run_virtiofs_tread_pool_64

>    READ: bw=32.3MiB/s (33.8MB/s), 32.3MiB/s-32.3MiB/s (33.8MB/s-33.8MB/s), io=150MiB (157MB), run=4648-4648msec

>   WRITE: bw=10.8MiB/s (11.3MB/s), 10.8MiB/s-10.8MiB/s (11.3MB/s-11.3MB/s), io=49.0MiB (52.4MB), run=4648-4648msec

> 

> Next: 

> - run https://github.com/rhvgoyal/virtiofs-tests for  tread_pool_0, tread_pool_64, and 9pfs

> - Test https://lore.kernel.org/linux-fsdevel/20201009181512.65496-1-vgoyal@redhat.com/

> 

> All the testing  for kata is based in:

> https://github.com/jcvenegas/mrunner/blob/master/scripts/bash_workloads/build-qemu-and-run-fio-test.sh

> 

> I ran this using an azure VM.

> 

> Cheers,

> Carlos

> 

> 

> On 05/11/20 13:53, "Vivek Goyal" <vgoyal@redhat.com> wrote:

> 

>     On Thu, Nov 05, 2020 at 02:44:16PM -0500, Vivek Goyal wrote:

>     > Right now we create a thread pool and main thread hands over the request

>     > to thread in thread pool to process. Number of threads in thread pool

>     > can be managed by option --thread-pool-size.

>     > 

>     > There is a chance that in case of some workloads, we might get better

>     > performance if we don't handover the request to a different thread

>     > and process in the context of thread receiving the request.

>     > 

>     > To implement that, redefine the meaning of --thread-pool-size=0 to

>     > mean that don't use a thread pool. Instead process the request in

>     > the context of thread receiving request from the queue.

>     > 

>     > I can't think how --thread-pool-size=0 is useful and hence using

>     > that. If it is already useful somehow, I could look at defining

>     > a new option say "--no-thread-pool".

>     > 

>     > I think this patch will be used more as a debug help to do comparison

>     > when it is more effecient to do not hand over the requests to a

>     > thread pool.

> 

>     I ran virtiofs-tests to comapre --thread-pool-size=0 and

>     --thread-pool-size=64. And results seem to be all over the place. In

>     some cases thread pool seems to perform batter and in other cases

>     no-thread-pool seems to perform better.

> 

>     But in general it looks like that except for the case of libaio workload,

>     no-thread-pool is helping.

> 

>     Thanks

>     Vivek

> 

>     NAME                    WORKLOAD                Bandwidth       IOPS            

>     thread-pool             seqread-psync           682.4mb         170.6k          

>     no-thread-pool          seqread-psync           679.3mb         169.8k          

> 

>     thread-pool             seqread-psync-multi     2415.9mb        603.9k          

>     no-thread-pool          seqread-psync-multi     2528.5mb        632.1k          

> 

>     thread-pool             seqread-mmap            591.7mb         147.9k          

>     no-thread-pool          seqread-mmap            595.6mb         148.9k          

> 

>     thread-pool             seqread-mmap-multi      2195.3mb        548.8k          

>     no-thread-pool          seqread-mmap-multi      2286.1mb        571.5k          

> 

>     thread-pool             seqread-libaio          329.1mb         82.2k           

>     no-thread-pool          seqread-libaio          271.5mb         67.8k           

> 

>     thread-pool             seqread-libaio-multi    1387.1mb        346.7k          

>     no-thread-pool          seqread-libaio-multi    1508.2mb        377.0k          

> 

>     thread-pool             randread-psync          59.0mb          14.7k           

>     no-thread-pool          randread-psync          78.5mb          19.6k           

> 

>     thread-pool             randread-psync-multi    226.4mb         56.6k           

>     no-thread-pool          randread-psync-multi    289.2mb         72.3k           

> 

>     thread-pool             randread-mmap           48.4mb          12.1k           

>     no-thread-pool          randread-mmap           57.4mb          14.3k           

> 

>     thread-pool             randread-mmap-multi     183.5mb         45.8k           

>     no-thread-pool          randread-mmap-multi     240.5mb         60.1k           

> 

>     thread-pool             randread-libaio         329.4mb         82.3k           

>     no-thread-pool          randread-libaio         261.8mb         65.4k           

> 

>     thread-pool             randread-libaio-multi   252.1mb         63.0k           

>     no-thread-pool          randread-libaio-multi   278.3mb         69.5k           

> 

>     thread-pool             seqwrite-psync          54.9mb          13.7k           

>     no-thread-pool          seqwrite-psync          77.8mb          19.4k           

> 

>     thread-pool             seqwrite-psync-multi    219.9mb         54.9k           

>     no-thread-pool          seqwrite-psync-multi    314.8mb         78.7k           

> 

>     thread-pool             seqwrite-mmap           191.7mb         47.9k           

>     no-thread-pool          seqwrite-mmap           201.4mb         50.3k           

> 

>     thread-pool             seqwrite-mmap-multi     346.6mb         86.6k           

>     no-thread-pool          seqwrite-mmap-multi     387.6mb         96.9k           

> 

>     thread-pool             seqwrite-libaio         306.4mb         76.6k           

>     no-thread-pool          seqwrite-libaio         248.2mb         62.0k           

> 

>     thread-pool             seqwrite-libaio-multi   328.5mb         82.1k           

>     no-thread-pool          seqwrite-libaio-multi   305.6mb         76.4k           

> 

>     thread-pool             randwrite-psync         55.6mb          13.9k           

>     no-thread-pool          randwrite-psync         81.2mb          20.3k           

> 

>     thread-pool             randwrite-psync-multi   227.0mb         56.7k           

>     no-thread-pool          randwrite-psync-multi   311.6mb         77.8k           

> 

>     thread-pool             randwrite-mmap          35.3mb          9038            

>     no-thread-pool          randwrite-mmap          58.1mb          14.5k           

> 

>     thread-pool             randwrite-mmap-multi    140.8mb         35.2k           

>     no-thread-pool          randwrite-mmap-multi    210.5mb         52.6k           

> 

>     thread-pool             randwrite-libaio        307.1mb         76.7k           

>     no-thread-pool          randwrite-libaio        279.4mb         69.8k           

> 

>     thread-pool             randwrite-libaio-multi  361.3mb         90.3k           

>     no-thread-pool          randwrite-libaio-multi  302.6mb         75.6k           

> 

>     thread-pool             randrw-psync            34.1mb/11.4mb   8754/2929       

>     no-thread-pool          randrw-psync            38.7mb/12.9mb   9919/3321       

> 

>     thread-pool             randrw-psync-multi      126.2mb/42.3mb  31.5k/10.5k     

>     no-thread-pool          randrw-psync-multi      172.2mb/57.7mb  43.0k/14.4k     

> 

>     thread-pool             randrw-mmap             24.1mb/8270kb   6194/2067       

>     no-thread-pool          randrw-mmap             42.0mb/14.0mb   10.5k/3606      

> 

>     thread-pool             randrw-mmap-multi       127.9mb/42.8mb  31.9k/10.7k     

>     no-thread-pool          randrw-mmap-multi       179.4mb/60.0mb  44.8k/15.0k     

> 

>     thread-pool             randrw-libaio           64.0mb/21.4mb   16.0k/5485      

>     no-thread-pool          randrw-libaio           79.6mb/26.6mb   19.9k/6831      

> 

>     thread-pool             randrw-libaio-multi     174.2mb/58.3mb  43.5k/14.5k     

>     no-thread-pool          randrw-libaio-multi     201.6mb/67.5mb  50.4k/16.8k     

> 

>

Dr. David Alan Gilbert Nov. 9, 2020, 10:02 a.m. UTC | #4

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote:

> > Hi Vivek,

> > 

> > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking. 

> > 

> > The fio config does  randrw:

> > 

> > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75

> > 

> 

> Hi Carlos,

> 

> Thanks for the testing. 

> 

> So basically two conclusions from your tests.

> 

> - for virtiofs, --thread-pool-size=0 is performing better as comapred

>   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately

>   35-40% better.

> 

> - virtio-9p is still approximately 30% better than virtiofs

>   --thread-pool-size=0.

> 

> As I had done the analysis that this particular workload (mixed read and

> write) is bad with virtiofs because after every write we are invalidating

> attrs and cache so next read ends up fetching attrs again. I had posted

> patches to gain some of the performance.

> 

> https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/

> 

> But I got the feedback to look into implementing file leases instead.

> 

> Anyway, good to know that --thread-pool-size=0 narrows the performance

> gap substantially (more than half) for this workload. I will look into

> identifying further bottlenecks.


Could you try this thread-pool-size=0 together with your attr patches;
does that catch up to 9p (even though I know you need to rework the attr
stuff as leases).

> Given in my tests, most of the workload benefited from

> --thread-pool-size=0, may be this can be the default and user can

> opt-in for a thread pool (--thread-pool-size=64) once it is known

> that a particular workload benefits from it.

> 

> David, Stefan, WDYT?


I think having the option makes sense; the default I'm not sure about;
what confuses me is why it helps - although it's also interesting it
helps 9p as well.

Dave

> 

> Thanks

> Vivek

> 

> > - I can see better results with this patch

> > - 9pfs is still better in the case of Kata because of the use of:

> > https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch

> > 

> > Results:

> > ./fio-results-run_virtiofs_tread_pool_0

> >    READ: bw=42.8MiB/s (44.8MB/s), 42.8MiB/s-42.8MiB/s (44.8MB/s-44.8MB/s), io=150MiB (157MB), run=3507-3507msec

> >   WRITE: bw=14.3MiB/s (14.9MB/s), 14.3MiB/s-14.3MiB/s (14.9MB/s-14.9MB/s), io=49.0MiB (52.4MB), run=3507-3507msec

> > ./fio-results-run_9pfs

> >    READ: bw=55.1MiB/s (57.8MB/s), 55.1MiB/s-55.1MiB/s (57.8MB/s-57.8MB/s), io=150MiB (157MB), run=2722-2722msec

> >   WRITE: bw=18.4MiB/s (19.3MB/s), 18.4MiB/s-18.4MiB/s (19.3MB/s-19.3MB/s), io=49.0MiB (52.4MB), run=2722-2722msec

> > ./fio-results-run_virtiofs_tread_pool_1

> >    READ: bw=34.5MiB/s (36.1MB/s), 34.5MiB/s-34.5MiB/s (36.1MB/s-36.1MB/s), io=150MiB (157MB), run=4354-4354msec

> >   WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=49.0MiB (52.4MB), run=4354-4354msec

> > ./fio-results-run_virtiofs_tread_pool_64

> >    READ: bw=32.3MiB/s (33.8MB/s), 32.3MiB/s-32.3MiB/s (33.8MB/s-33.8MB/s), io=150MiB (157MB), run=4648-4648msec

> >   WRITE: bw=10.8MiB/s (11.3MB/s), 10.8MiB/s-10.8MiB/s (11.3MB/s-11.3MB/s), io=49.0MiB (52.4MB), run=4648-4648msec

> > 

> > Next: 

> > - run https://github.com/rhvgoyal/virtiofs-tests for  tread_pool_0, tread_pool_64, and 9pfs

> > - Test https://lore.kernel.org/linux-fsdevel/20201009181512.65496-1-vgoyal@redhat.com/

> > 

> > All the testing  for kata is based in:

> > https://github.com/jcvenegas/mrunner/blob/master/scripts/bash_workloads/build-qemu-and-run-fio-test.sh

> > 

> > I ran this using an azure VM.

> > 

> > Cheers,

> > Carlos

> > 

> > 

> > On 05/11/20 13:53, "Vivek Goyal" <vgoyal@redhat.com> wrote:

> > 

> >     On Thu, Nov 05, 2020 at 02:44:16PM -0500, Vivek Goyal wrote:

> >     > Right now we create a thread pool and main thread hands over the request

> >     > to thread in thread pool to process. Number of threads in thread pool

> >     > can be managed by option --thread-pool-size.

> >     > 

> >     > There is a chance that in case of some workloads, we might get better

> >     > performance if we don't handover the request to a different thread

> >     > and process in the context of thread receiving the request.

> >     > 

> >     > To implement that, redefine the meaning of --thread-pool-size=0 to

> >     > mean that don't use a thread pool. Instead process the request in

> >     > the context of thread receiving request from the queue.

> >     > 

> >     > I can't think how --thread-pool-size=0 is useful and hence using

> >     > that. If it is already useful somehow, I could look at defining

> >     > a new option say "--no-thread-pool".

> >     > 

> >     > I think this patch will be used more as a debug help to do comparison

> >     > when it is more effecient to do not hand over the requests to a

> >     > thread pool.

> > 

> >     I ran virtiofs-tests to comapre --thread-pool-size=0 and

> >     --thread-pool-size=64. And results seem to be all over the place. In

> >     some cases thread pool seems to perform batter and in other cases

> >     no-thread-pool seems to perform better.

> > 

> >     But in general it looks like that except for the case of libaio workload,

> >     no-thread-pool is helping.

> > 

> >     Thanks

> >     Vivek

> > 

> >     NAME                    WORKLOAD                Bandwidth       IOPS            

> >     thread-pool             seqread-psync           682.4mb         170.6k          

> >     no-thread-pool          seqread-psync           679.3mb         169.8k          

> > 

> >     thread-pool             seqread-psync-multi     2415.9mb        603.9k          

> >     no-thread-pool          seqread-psync-multi     2528.5mb        632.1k          

> > 

> >     thread-pool             seqread-mmap            591.7mb         147.9k          

> >     no-thread-pool          seqread-mmap            595.6mb         148.9k          

> > 

> >     thread-pool             seqread-mmap-multi      2195.3mb        548.8k          

> >     no-thread-pool          seqread-mmap-multi      2286.1mb        571.5k          

> > 

> >     thread-pool             seqread-libaio          329.1mb         82.2k           

> >     no-thread-pool          seqread-libaio          271.5mb         67.8k           

> > 

> >     thread-pool             seqread-libaio-multi    1387.1mb        346.7k          

> >     no-thread-pool          seqread-libaio-multi    1508.2mb        377.0k          

> > 

> >     thread-pool             randread-psync          59.0mb          14.7k           

> >     no-thread-pool          randread-psync          78.5mb          19.6k           

> > 

> >     thread-pool             randread-psync-multi    226.4mb         56.6k           

> >     no-thread-pool          randread-psync-multi    289.2mb         72.3k           

> > 

> >     thread-pool             randread-mmap           48.4mb          12.1k           

> >     no-thread-pool          randread-mmap           57.4mb          14.3k           

> > 

> >     thread-pool             randread-mmap-multi     183.5mb         45.8k           

> >     no-thread-pool          randread-mmap-multi     240.5mb         60.1k           

> > 

> >     thread-pool             randread-libaio         329.4mb         82.3k           

> >     no-thread-pool          randread-libaio         261.8mb         65.4k           

> > 

> >     thread-pool             randread-libaio-multi   252.1mb         63.0k           

> >     no-thread-pool          randread-libaio-multi   278.3mb         69.5k           

> > 

> >     thread-pool             seqwrite-psync          54.9mb          13.7k           

> >     no-thread-pool          seqwrite-psync          77.8mb          19.4k           

> > 

> >     thread-pool             seqwrite-psync-multi    219.9mb         54.9k           

> >     no-thread-pool          seqwrite-psync-multi    314.8mb         78.7k           

> > 

> >     thread-pool             seqwrite-mmap           191.7mb         47.9k           

> >     no-thread-pool          seqwrite-mmap           201.4mb         50.3k           

> > 

> >     thread-pool             seqwrite-mmap-multi     346.6mb         86.6k           

> >     no-thread-pool          seqwrite-mmap-multi     387.6mb         96.9k           

> > 

> >     thread-pool             seqwrite-libaio         306.4mb         76.6k           

> >     no-thread-pool          seqwrite-libaio         248.2mb         62.0k           

> > 

> >     thread-pool             seqwrite-libaio-multi   328.5mb         82.1k           

> >     no-thread-pool          seqwrite-libaio-multi   305.6mb         76.4k           

> > 

> >     thread-pool             randwrite-psync         55.6mb          13.9k           

> >     no-thread-pool          randwrite-psync         81.2mb          20.3k           

> > 

> >     thread-pool             randwrite-psync-multi   227.0mb         56.7k           

> >     no-thread-pool          randwrite-psync-multi   311.6mb         77.8k           

> > 

> >     thread-pool             randwrite-mmap          35.3mb          9038            

> >     no-thread-pool          randwrite-mmap          58.1mb          14.5k           

> > 

> >     thread-pool             randwrite-mmap-multi    140.8mb         35.2k           

> >     no-thread-pool          randwrite-mmap-multi    210.5mb         52.6k           

> > 

> >     thread-pool             randwrite-libaio        307.1mb         76.7k           

> >     no-thread-pool          randwrite-libaio        279.4mb         69.8k           

> > 

> >     thread-pool             randwrite-libaio-multi  361.3mb         90.3k           

> >     no-thread-pool          randwrite-libaio-multi  302.6mb         75.6k           

> > 

> >     thread-pool             randrw-psync            34.1mb/11.4mb   8754/2929       

> >     no-thread-pool          randrw-psync            38.7mb/12.9mb   9919/3321       

> > 

> >     thread-pool             randrw-psync-multi      126.2mb/42.3mb  31.5k/10.5k     

> >     no-thread-pool          randrw-psync-multi      172.2mb/57.7mb  43.0k/14.4k     

> > 

> >     thread-pool             randrw-mmap             24.1mb/8270kb   6194/2067       

> >     no-thread-pool          randrw-mmap             42.0mb/14.0mb   10.5k/3606      

> > 

> >     thread-pool             randrw-mmap-multi       127.9mb/42.8mb  31.9k/10.7k     

> >     no-thread-pool          randrw-mmap-multi       179.4mb/60.0mb  44.8k/15.0k     

> > 

> >     thread-pool             randrw-libaio           64.0mb/21.4mb   16.0k/5485      

> >     no-thread-pool          randrw-libaio           79.6mb/26.6mb   19.9k/6831      

> > 

> >     thread-pool             randrw-libaio-multi     174.2mb/58.3mb  43.5k/14.5k     

> >     no-thread-pool          randrw-libaio-multi     201.6mb/67.5mb  50.4k/16.8k     

> > 

> > 

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Dr. David Alan Gilbert Nov. 9, 2020, 10:11 a.m. UTC | #5

* Vivek Goyal (vgoyal@redhat.com) wrote:
> Right now we create a thread pool and main thread hands over the request

> to thread in thread pool to process. Number of threads in thread pool

> can be managed by option --thread-pool-size.

> 

> There is a chance that in case of some workloads, we might get better

> performance if we don't handover the request to a different thread

> and process in the context of thread receiving the request.

> 

> To implement that, redefine the meaning of --thread-pool-size=0 to

> mean that don't use a thread pool. Instead process the request in

> the context of thread receiving request from the queue.

> 

> I can't think how --thread-pool-size=0 is useful and hence using

> that. If it is already useful somehow, I could look at defining

> a new option say "--no-thread-pool".

> 

> I think this patch will be used more as a debug help to do comparison

> when it is more effecient to do not hand over the requests to a

> thread pool.

> 

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>


I think this is OK, but you need to fix the style to match qemu rather
than kernel style.
(See qemu's scripts/checpatch.pl)

Dave

> ---

>  tools/virtiofsd/fuse_virtio.c | 33 ++++++++++++++++++++++++---------

>  1 file changed, 24 insertions(+), 9 deletions(-)

> 

> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c

> index ff86f6d1ce..60aa7cd3e5 100644

> --- a/tools/virtiofsd/fuse_virtio.c

> +++ b/tools/virtiofsd/fuse_virtio.c

> @@ -695,13 +695,17 @@ static void *fv_queue_thread(void *opaque)

>      struct VuDev *dev = &qi->virtio_dev->dev;

>      struct VuVirtq *q = vu_get_queue(dev, qi->qidx);

>      struct fuse_session *se = qi->virtio_dev->se;

> -    GThreadPool *pool;

> -

> -    pool = g_thread_pool_new(fv_queue_worker, qi, se->thread_pool_size, FALSE,

> -                             NULL);

> -    if (!pool) {

> -        fuse_log(FUSE_LOG_ERR, "%s: g_thread_pool_new failed\n", __func__);

> -        return NULL;

> +    GThreadPool *pool = NULL;

> +    GList *req_list = NULL;

> +

> +    if (se->thread_pool_size) {

> +        fuse_log(FUSE_LOG_DEBUG, "%s: Creating thread pool for Queue %d\n", __func__, qi->qidx);

> +        pool = g_thread_pool_new(fv_queue_worker, qi, se->thread_pool_size,

> +                                 FALSE, NULL);

> +        if (!pool) {

> +            fuse_log(FUSE_LOG_ERR, "%s: g_thread_pool_new failed\n", __func__);

> +            return NULL;

> +        }

>      }

>  

>      fuse_log(FUSE_LOG_INFO, "%s: Start for queue %d kick_fd %d\n", __func__,

> @@ -780,14 +784,25 @@ static void *fv_queue_thread(void *opaque)

>              req->bad_in_num = bad_in_num;

>              req->bad_out_num = bad_out_num;

>  

> -            g_thread_pool_push(pool, req, NULL);

> +            if (!se->thread_pool_size)

> +                req_list = g_list_prepend(req_list, req);

> +            else

> +                g_thread_pool_push(pool, req, NULL);

>          }

>  

>          pthread_mutex_unlock(&qi->vq_lock);

>          pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock);

> +

> +        // Process all the requests.

> +        if (!se->thread_pool_size && req_list != NULL) {

> +	    g_list_foreach(req_list, fv_queue_worker, qi);

> +	    g_list_free(req_list);

> +            req_list = NULL;

> +	}

>      }

>  

> -    g_thread_pool_free(pool, FALSE, TRUE);

> +    if (pool)

> +        g_thread_pool_free(pool, FALSE, TRUE);

>  

>      return NULL;

>  }

> -- 

> 2.25.4

> 

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Vivek Goyal Nov. 9, 2020, 2:39 p.m. UTC | #6

On Mon, Nov 09, 2020 at 10:02:18AM +0000, Dr. David Alan Gilbert wrote:

[..]
> > Given in my tests, most of the workload benefited from

> > --thread-pool-size=0, may be this can be the default and user can

> > opt-in for a thread pool (--thread-pool-size=64) once it is known

> > that a particular workload benefits from it.

> > 

> > David, Stefan, WDYT?

> 

> I think having the option makes sense; the default I'm not sure about;


Ok, for now we can probably just introduce this option. And if further
test results prove it to be faster for most of the people, then we
can consider switching the default.

Vivek

Vivek Goyal Nov. 9, 2020, 2:40 p.m. UTC | #7

On Mon, Nov 09, 2020 at 10:11:02AM +0000, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:

> > Right now we create a thread pool and main thread hands over the request

> > to thread in thread pool to process. Number of threads in thread pool

> > can be managed by option --thread-pool-size.

> > 

> > There is a chance that in case of some workloads, we might get better

> > performance if we don't handover the request to a different thread

> > and process in the context of thread receiving the request.

> > 

> > To implement that, redefine the meaning of --thread-pool-size=0 to

> > mean that don't use a thread pool. Instead process the request in

> > the context of thread receiving request from the queue.

> > 

> > I can't think how --thread-pool-size=0 is useful and hence using

> > that. If it is already useful somehow, I could look at defining

> > a new option say "--no-thread-pool".

> > 

> > I think this patch will be used more as a debug help to do comparison

> > when it is more effecient to do not hand over the requests to a

> > thread pool.

> > 

> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

> 

> I think this is OK, but you need to fix the style to match qemu rather

> than kernel style.

> (See qemu's scripts/checpatch.pl)


Sure. I just posted V2 of the patch and took care of all the issues
reported by qemu's scripts/checkpatch.pl.

Vivek

Miklos Szeredi Nov. 12, 2020, 9:06 a.m. UTC | #8

On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>

> On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote:

> > Hi Vivek,

> >

> > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking.

> >

> > The fio config does  randrw:

> >

> > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75

> >

>

> Hi Carlos,

>

> Thanks for the testing.

>

> So basically two conclusions from your tests.

>

> - for virtiofs, --thread-pool-size=0 is performing better as comapred

>   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately

>   35-40% better.

>

> - virtio-9p is still approximately 30% better than virtiofs

>   --thread-pool-size=0.

>

> As I had done the analysis that this particular workload (mixed read and

> write) is bad with virtiofs because after every write we are invalidating

> attrs and cache so next read ends up fetching attrs again. I had posted

> patches to gain some of the performance.

>

> https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/

>

> But I got the feedback to look into implementing file leases instead.


Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it
off for now?   9p doesn't have it, so no point in enabling it for
virtiofs by default.

Also I think some confusion comes from cache=auto being the default
for virtiofs.    Not sure what the default is for 9p, but comparing
default to default will definitely not be apples to apples since this
mode is nonexistent in 9p.

9p:cache=none  <-> virtiofs:cache=none
9p:cache=loose <-> virtiofs:cache=always

"9p:cache=mmap" and "virtiofs:cache=auto" have no match.

Untested patch attached.

Thanks,
Miklos

Miklos Szeredi Nov. 12, 2020, 9:57 a.m. UTC | #9

On Thu, Nov 12, 2020 at 10:06 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
>

> On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> >

> > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote:

> > > Hi Vivek,

> > >

> > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking.

> > >

> > > The fio config does  randrw:

> > >

> > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75

> > >

> >

> > Hi Carlos,

> >

> > Thanks for the testing.

> >

> > So basically two conclusions from your tests.

> >

> > - for virtiofs, --thread-pool-size=0 is performing better as comapred

> >   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately

> >   35-40% better.

> >

> > - virtio-9p is still approximately 30% better than virtiofs

> >   --thread-pool-size=0.

> >

> > As I had done the analysis that this particular workload (mixed read and

> > write) is bad with virtiofs because after every write we are invalidating

> > attrs and cache so next read ends up fetching attrs again. I had posted

> > patches to gain some of the performance.

> >

> > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/

> >

> > But I got the feedback to look into implementing file leases instead.

>

> Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it

> off for now?   9p doesn't have it, so no point in enabling it for

> virtiofs by default.

>

> Also I think some confusion comes from cache=auto being the default

> for virtiofs.    Not sure what the default is for 9p, but comparing

> default to default will definitely not be apples to apples since this

> mode is nonexistent in 9p.

>

> 9p:cache=none  <-> virtiofs:cache=none

> 9p:cache=loose <-> virtiofs:cache=always

>

> "9p:cache=mmap" and "virtiofs:cache=auto" have no match.

>

> Untested patch attached.


Another performance issue with virtiofs could be due to the strict
page writeback rules in fuse that are meant to prevent misuse of
kernel memory by unprivileged processes.   Since virtiofs isn't
subject to that limitation, these could be relaxed.

Attaching a patch that does one half of this.  The other half is
getting rid of the page copying, that's a bit more involved, but
shouldn't be too difficult.  Just need to duplicate the cached
writeback callbacks for virtiofs and do away with the complex temp
page stuff.

Thanks,
Miklos
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d414c787e362..92c92c482c57 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -502,6 +502,7 @@ struct fuse_fs_context {
 	bool no_force_umount:1;
 	bool legacy_opts_show:1;
 	bool dax:1;
+	bool relax_writeback:1;
 	unsigned int max_read;
 	unsigned int blksize;
 	const char *subtype;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 36ab05315828..029325ebd1b3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1130,7 +1130,8 @@ void fuse_free_conn(struct fuse_conn *fc)
 }
 EXPORT_SYMBOL_GPL(fuse_free_conn);
 
-static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
+static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb,
+			 struct fuse_fs_context *ctx)
 {
 	int err;
 	char *suffix = "";
@@ -1151,21 +1152,24 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 
 	/* fuse does it's own writeback accounting */
 	sb->s_bdi->capabilities &= ~BDI_CAP_WRITEBACK_ACCT;
-	sb->s_bdi->capabilities |= BDI_CAP_STRICTLIMIT;
 
-	/*
-	 * For a single fuse filesystem use max 1% of dirty +
-	 * writeback threshold.
-	 *
-	 * This gives about 1M of write buffer for memory maps on a
-	 * machine with 1G and 10% dirty_ratio, which should be more
-	 * than enough.
-	 *
-	 * Privileged users can raise it by writing to
-	 *
-	 *    /sys/class/bdi/<bdi>/max_ratio
-	 */
-	bdi_set_max_ratio(sb->s_bdi, 1);
+	if (!ctx->relax_writeback) {
+		sb->s_bdi->capabilities |= BDI_CAP_STRICTLIMIT;
+
+		/*
+		 * For a single fuse filesystem use max 1% of dirty +
+		 * writeback threshold.
+		 *
+		 * This gives about 1M of write buffer for memory maps on a
+		 * machine with 1G and 10% dirty_ratio, which should be more
+		 * than enough.
+		 *
+		 * Privileged users can raise it by writing to
+		 *
+		 *    /sys/class/bdi/<bdi>/max_ratio
+		 */
+		bdi_set_max_ratio(sb->s_bdi, 1);
+	}
 
 	return 0;
 }
@@ -1354,7 +1358,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 
 	fc->dev = sb->s_dev;
 	fm->sb = sb;
-	err = fuse_bdi_init(fc, sb);
+	err = fuse_bdi_init(fc, sb, ctx);
 	if (err)
 		goto err_dev_free;
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 8868ac31a3c0..efbe1697612e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1288,6 +1288,7 @@ static inline void virtio_fs_ctx_set_defaults(struct fuse_fs_context *ctx)
 	ctx->destroy = true;
 	ctx->no_control = true;
 	ctx->no_force_umount = true;
+	ctx->relax_writeback = true;
 }
 
 static int virtio_fs_fill_super(struct super_block *sb, struct fs_context *fsc)

Christian Schoenebeck Nov. 12, 2020, 11 a.m. UTC | #10

On Donnerstag, 12. November 2020 10:06:37 CET Miklos Szeredi wrote:
> On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos 

wrote:
> > > Hi Vivek,

> > > 

> > > I have tested with Kata 1.12-apha0, the results seems that are better

> > > for the use fio config I am tracking.

> > > 

> > > The fio config does  randrw:

> > > 

> > > fio --direct=1 --gtod_reduce=1 --name=test

> > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M

> > > --readwrite=randrw --rwmixread=75> 

> > Hi Carlos,

> > 

> > Thanks for the testing.

> > 

> > So basically two conclusions from your tests.

> > 

> > - for virtiofs, --thread-pool-size=0 is performing better as comapred

> > 

> >   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately

> >   35-40% better.

> > 

> > - virtio-9p is still approximately 30% better than virtiofs

> > 

> >   --thread-pool-size=0.

> > 

> > As I had done the analysis that this particular workload (mixed read and

> > write) is bad with virtiofs because after every write we are invalidating

> > attrs and cache so next read ends up fetching attrs again. I had posted

> > patches to gain some of the performance.

> > 

> > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/

> > 

> > But I got the feedback to look into implementing file leases instead.

> 

> Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it

> off for now?   9p doesn't have it, so no point in enabling it for

> virtiofs by default.

> 

> Also I think some confusion comes from cache=auto being the default

> for virtiofs.    Not sure what the default is for 9p, but comparing

> default to default will definitely not be apples to apples since this

> mode is nonexistent in 9p.


The default for 9p is cache=none.

That should be changed to cache=mmap being default IMO, because if users stick 
with the default setting 'none', it breaks software relying on mmap() calls.

> 

> 9p:cache=none  <-> virtiofs:cache=none

> 9p:cache=loose <-> virtiofs:cache=always

> 

> "9p:cache=mmap" and "virtiofs:cache=auto" have no match.


What does 'auto' do exactly?

> 

> Untested patch attached.

> 

> Thanks,

> Miklos


Best regards,
Christian Schoenebeck

Miklos Szeredi Nov. 12, 2020, 1:01 p.m. UTC | #11

On Thu, Nov 12, 2020 at 12:34 PM Christian Schoenebeck
<qemu_oss@crudebyte.com> wrote:
>

> On Donnerstag, 12. November 2020 10:06:37 CET Miklos Szeredi wrote:

> >

> > 9p:cache=none  <-> virtiofs:cache=none

> > 9p:cache=loose <-> virtiofs:cache=always

> >

> > "9p:cache=mmap" and "virtiofs:cache=auto" have no match.

>

> What does 'auto' do exactly?

It has a configurable timeout (default is 1s) for the attribute and
lookup cache and close-to-open consistency for data (flush on close,
invalidate on open).  This is similar to, but less refined as NFS.
It's something that could easily be improved if there's a need for it.

Thanks,
Miklos

Venegas Munoz, Jose Carlos Nov. 17, 2020, 4 p.m. UTC | #12

>   Not sure what the default is for 9p, but comparing

>   default to default will definitely not be apples to apples since this

>   mode is nonexistent in 9p.


In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples,
we are for the best config for both and comparing the best that each of them can do.

In the case of Kata for 9pfs(this is the config we have found has better performance and fs compatibility  in general) we have:
```
-device virtio-9p-pci # device type
,disable-modern=false 
,fsdev=extra-9p-kataShared # attr: device id  for fsdev
,mount_tag=kataShared  # attr: tag on how will be found de sharedfs 
,romfile= 
-fsdev local  #local: Simply lets QEMU call the individual VFS functions (more or less) directly on host.
,id=extra-9p-kataShared 
,path=${SHARED_PATH} # attrs: path to share
,security_model=none #    
#    passthrough: Files are stored using the same credentials as they are created on the guest. This requires QEMU to run as # root.
#    none: Same as "passthrough" except the sever won't report failures if it fails to set file attributes like ownership # 
#  (chown). This makes a passthrough like security model usable for people who run kvm as non root.
,multidevs=remap
```

The mount options are:
```
trans=virtio 
    ,version=9p2000.L 
    ,cache=mmap 
    ,"nodev" # Security: The nodev mount option specifies that the filesystem cannot contain special devices. 
    ,"msize=8192" # msize: Maximum packet size including any headers.
```

Aditionally we use this patch https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch

In kata for virtiofs I am testing  use:
```
-chardev socket
,id=ID-SOCKET
,path=.../vhost-fs.sock  # Path to vhost socket
-device vhost-user-fs-pci #
,chardev=ID-SOCKET 
,tag=kataShared 
,romfile=

 -object memory-backend-file # force use of memory sharable with virtiofsd. 
 ,id=dimm1
 ,size=2048M
 ,mem-path=/dev/shm 
 ,share=on
```
Virtiofsd:
```
-o cache=auto 
-o no_posix_lock # enable/disable remote posix lock
--thread-pool-size=0
```

And virtiofs mount options:
```
source:\"kataShared\" 
fstype:\"virtiofs\"
```

With this patch, comparing this two configurations, I have seen better performance with virtiofs in different hosts.

-
Carlos

-

On 12/11/20 3:06, "Miklos Szeredi" <mszeredi@redhat.com> wrote:

    On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote:
    >

    > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote:

    > > Hi Vivek,

    > >

    > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking.

    > >

    > > The fio config does  randrw:

    > >

    > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75

    > >

    >

    > Hi Carlos,

    >

    > Thanks for the testing.

    >

    > So basically two conclusions from your tests.

    >

    > - for virtiofs, --thread-pool-size=0 is performing better as comapred

    >   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately

    >   35-40% better.

    >

    > - virtio-9p is still approximately 30% better than virtiofs

    >   --thread-pool-size=0.

    >

    > As I had done the analysis that this particular workload (mixed read and

    > write) is bad with virtiofs because after every write we are invalidating

    > attrs and cache so next read ends up fetching attrs again. I had posted

    > patches to gain some of the performance.

    >

    > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/

    >

    > But I got the feedback to look into implementing file leases instead.


    Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it
    off for now?   9p doesn't have it, so no point in enabling it for
    virtiofs by default.

    Also I think some confusion comes from cache=auto being the default
    for virtiofs.    Not sure what the default is for 9p, but comparing
    default to default will definitely not be apples to apples since this
    mode is nonexistent in 9p.

    9p:cache=none  <-> virtiofs:cache=none
    9p:cache=loose <-> virtiofs:cache=always

    "9p:cache=mmap" and "virtiofs:cache=auto" have no match.

    Untested patch attached.

    Thanks,
    Miklos

Christian Schoenebeck Nov. 17, 2020, 4:39 p.m. UTC | #13

On Dienstag, 17. November 2020 17:00:09 CET Venegas Munoz, Jose Carlos wrote:
> >   Not sure what the default is for 9p, but comparing

> >   default to default will definitely not be apples to apples since this

> >   mode is nonexistent in 9p.

> 

> 

> In Kata we are looking for the best config for fs compatibility and

> performance. So even if are no apples to apples, we are for the best

> config for both and comparing the best that each of them can do. 

> In the case of Kata for 9pfs(this is the config we have found has better

> performance and fs compatibility  in general) we have:

> ```

> -device virtio-9p-pci # device type

> ,disable-modern=false 

> ,fsdev=extra-9p-kataShared # attr: device id  for fsdev

> ,mount_tag=kataShared  # attr: tag on how will be found de sharedfs 

> ,romfile= 

> -fsdev local  #local: Simply lets QEMU call the individual VFS functions

> (more or less) directly on host.

> ,id=extra-9p-kataShared

> ,path=${SHARED_PATH} # attrs: path to share

> ,security_model=none #    

> #    passthrough: Files are stored using the same credentials as they are

> created on the guest. This requires QEMU to run as # root.

> #    none: Same

> as "passthrough" except the sever won't report failures if it fails to set

> file attributes like ownership # #  (chown). This makes a passthrough like

> security model usable for people who run kvm as non root. ,multidevs=remap

> ```

> 

> The mount options are:

> ```

> trans=virtio 

>     ,version=9p2000.L 

>     ,cache=mmap 

>     ,"nodev" # Security: The nodev mount option specifies that the

> filesystem cannot contain special devices. 

> ,"msize=8192" # msize: Maximum


Too small. You should definitely set 'msize' larger, as small a msize value 
causes 9p requests to be broken down into more and smaller 9p requests:
https://wiki.qemu.org/Documentation/9psetup#msize

Setting msize to a fairly high value might also substantially increase I/O 
performance i.e. on large I/O if guest application honors st_blksize returned 
by calling *stat() [ https://man7.org/linux/man-pages/man2/stat.2.html ]:

       st_blksize
              This field gives the "preferred" block size for efficient
              filesystem I/O.

For instance 'time cat /large/file' would be substantially faster, as it 
automatically adapts the chunk size.

> packet size including any headers. ```

> 

> Aditionally we use this patch

> https://github.com/kata-containers/packaging/blob/stable-1.12/qemu/patches/

> 5.0.x/0001-9p-removing-coroutines-of-9p-to-increase-the-I-O-per.patch


Interesting radical approach. :)

Yeah, the coroutine code in 9pfs is still a huge bottleneck. The problem of 
the existing coroutine code is that it dispatches too often between threads 
(i.e. main I/O thread of 9p server and background I/O threads doing the fs 
driver work) which causes huge latencies.

I started to fix that, yet I only completed the Treaddir request handler so 
far, which however gave a huge performance boost on such Treaddir requests:
https://lists.gnu.org/archive/html/qemu-devel/2020-01/msg05539.html

The plan is to do the same for the other 9p request handlers, and finally 
introducing an 'order=relaxed' option (right now the order is strict only).

That should end up being faster than your current no-coroutine hack later on, 
because you would have the benefits of both worlds: low latency and 
parallelism for the fs driver I/O work.

> In kata for virtiofs I am testing  use:

> ```

> -chardev socket

> ,id=ID-SOCKET

> ,path=.../vhost-fs.sock  # Path to vhost socket

> -device vhost-user-fs-pci #

> ,chardev=ID-SOCKET 

> ,tag=kataShared 

> ,romfile=

> 

>  -object memory-backend-file # force use of memory sharable with virtiofsd.

>  ,id=dimm1

>  ,size=2048M

>  ,mem-path=/dev/shm 

>  ,share=on

> ```

> Virtiofsd:

> ```

> -o cache=auto 

> -o no_posix_lock # enable/disable remote posix lock

> --thread-pool-size=0

> ```

> 

> And virtiofs mount options:

> ```

> source:\"kataShared\" 

> fstype:\"virtiofs\"

> ```

> 

> With this patch, comparing this two configurations, I have seen better

> performance with virtiofs in different hosts.

> -

> Carlos

> 

> -

> 

> On 12/11/20 3:06, "Miklos Szeredi" <mszeredi@redhat.com> wrote:

> 

>     On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> 

>     >

>     >

>     > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos

>     > wrote:

>     > > 

>     > > Hi Vivek,

>     > >

>     > >

>     > >

>     > > I have tested with Kata 1.12-apha0, the results seems that are

>     > > better for the use fio config I am tracking.

>     > > > >

>     > >

>     > >

>     > > The fio config does  randrw:

>     > >

>     > >

>     > >

>     > > fio --direct=1 --gtod_reduce=1 --name=test

>     > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M

>     > > --readwrite=randrw --rwmixread=75

>     > > > >

>     > >

>     >

>     >

>     >

>     > Hi Carlos,

>     >

>     >

>     >

>     > Thanks for the testing.

>     >

>     >

>     >

>     > So basically two conclusions from your tests.

>     >

>     >

>     >

>     > - for virtiofs, --thread-pool-size=0 is performing better as comapred

>     > 

>     >   to --thread-pool-size=1 as well as --thread-pool-size=64.

>     >   Approximately

>     >   35-40% better.

>     >

>     >

>     >

>     > - virtio-9p is still approximately 30% better than virtiofs

>     > 

>     >   --thread-pool-size=0.

>     >

>     >

>     >

>     > As I had done the analysis that this particular workload (mixed read

>     > and

>     > write) is bad with virtiofs because after every write we are

>     > invalidating

>     > attrs and cache so next read ends up fetching attrs again. I had

>     > posted

>     > patches to gain some of the performance.

>     >

>     >

>     >

>     > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.c

>     > om/

>     >

>     >

>     >

>     > But I got the feedback to look into implementing file leases instead.

> 

> 

>     Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it

>     off for now?   9p doesn't have it, so no point in enabling it for

>     virtiofs by default.

> 

>     Also I think some confusion comes from cache=auto being the default

>     for virtiofs.    Not sure what the default is for 9p, but comparing

>     default to default will definitely not be apples to apples since this

>     mode is nonexistent in 9p.

> 

>     9p:cache=none  <-> virtiofs:cache=none

>     9p:cache=loose <-> virtiofs:cache=always

> 

>     "9p:cache=mmap" and "virtiofs:cache=auto" have no match.

> 

>     Untested patch attached.

> 

>     Thanks,

>     Miklos

> 


Best regards,
Christian Schoenebeck

Vivek Goyal Nov. 17, 2020, 6:55 p.m. UTC | #14

On Tue, Nov 17, 2020 at 04:00:09PM +0000, Venegas Munoz, Jose Carlos wrote:
> >   Not sure what the default is for 9p, but comparing

> >   default to default will definitely not be apples to apples since this

> >   mode is nonexistent in 9p.

> 

> In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples,

> we are for the best config for both and comparing the best that each of them can do.

> 

> In the case of Kata for 9pfs(this is the config we have found has better performance and fs compatibility  in general) we have:

> ```

> -device virtio-9p-pci # device type

> ,disable-modern=false 

> ,fsdev=extra-9p-kataShared # attr: device id  for fsdev

> ,mount_tag=kataShared  # attr: tag on how will be found de sharedfs 

> ,romfile= 

> -fsdev local  #local: Simply lets QEMU call the individual VFS functions (more or less) directly on host.

> ,id=extra-9p-kataShared 

> ,path=${SHARED_PATH} # attrs: path to share

> ,security_model=none #    

> #    passthrough: Files are stored using the same credentials as they are created on the guest. This requires QEMU to run as # root.

> #    none: Same as "passthrough" except the sever won't report failures if it fails to set file attributes like ownership # 

> #  (chown). This makes a passthrough like security model usable for people who run kvm as non root.

> ,multidevs=remap

> ```

> 

> The mount options are:

> ```

> trans=virtio 

>     ,version=9p2000.L 

>     ,cache=mmap 

>     ,"nodev" # Security: The nodev mount option specifies that the filesystem cannot contain special devices. 

>     ,"msize=8192" # msize: Maximum packet size including any headers.

> ```


How much RAM you are giving to these containers when using virtio-9p?

Vivek

Vivek Goyal Nov. 17, 2020, 10:24 p.m. UTC | #15

On Tue, Nov 17, 2020 at 04:00:09PM +0000, Venegas Munoz, Jose Carlos wrote:
> >   Not sure what the default is for 9p, but comparing

> >   default to default will definitely not be apples to apples since this

> >   mode is nonexistent in 9p.

> 

> In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples,

> we are for the best config for both and comparing the best that each of them can do.

Can we run tests in more than one configuration. Right now you are using
cache=mmap for virtio-9p and cache=auto for virtiofs and as Miklos
said this is not apples to apples comparison.

So you can continue to run above but also run additional tests if
time permits.

virtio-9p	virtio-fs

cache=mmap	cache=none + DAX
cache=none	cache=none
cache=loose	cache=always

Given you are using cache=mmap for virtio-9p, "cache=none + DAX" is
somewhat equivalent of that. Provides strong coherency as well as
allow for mmap() to work.

Now kernel virtiofs DAX support got merged in 5.10-rc1. For qemu,
you can use virtio-fs-dev branch.

https://gitlab.com/virtio-fs/qemu/-/commits/virtio-fs-dev

Thanks
Vivek

Vivek Goyal Nov. 18, 2020, 7:51 p.m. UTC | #16

On Thu, Nov 12, 2020 at 10:06:37AM +0100, Miklos Szeredi wrote:
> On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> >

> > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote:

> > > Hi Vivek,

> > >

> > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking.

> > >

> > > The fio config does  randrw:

> > >

> > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75

> > >

> >

> > Hi Carlos,

> >

> > Thanks for the testing.

> >

> > So basically two conclusions from your tests.

> >

> > - for virtiofs, --thread-pool-size=0 is performing better as comapred

> >   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately

> >   35-40% better.

> >

> > - virtio-9p is still approximately 30% better than virtiofs

> >   --thread-pool-size=0.

> >

> > As I had done the analysis that this particular workload (mixed read and

> > write) is bad with virtiofs because after every write we are invalidating

> > attrs and cache so next read ends up fetching attrs again. I had posted

> > patches to gain some of the performance.

> >

> > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/

> >

> > But I got the feedback to look into implementing file leases instead.

> 

> Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it

> off for now?   9p doesn't have it, so no point in enabling it for

> virtiofs by default.


If we disable FUSE_AUTO_INVAL_DATA, then client page cache will not
be invalidated even after 1 sec, right? (for cache=auto).

Given now we also want to target sharing directory tree among multiple
clients, keeping FUSE_AUTO_INVAL_DATA enabled should help.

> 

> Also I think some confusion comes from cache=auto being the default

> for virtiofs.    Not sure what the default is for 9p, but comparing

> default to default will definitely not be apples to apples since this

> mode is nonexistent in 9p.

> 

> 9p:cache=none  <-> virtiofs:cache=none

> 9p:cache=loose <-> virtiofs:cache=always

> 

> "9p:cache=mmap" and "virtiofs:cache=auto" have no match.


Agreed from performance comparison point of view.

I will prefer cache=auto default (over cache=none) for virtiofsd. During
some kernel compilation tests over virtiofs, cache=none was painfully
slow as compared to cache=auto.

Thanks
Vivek

Vivek Goyal Nov. 18, 2020, 8:19 p.m. UTC | #17

On Thu, Nov 12, 2020 at 10:57:11AM +0100, Miklos Szeredi wrote:
[..]
> 

> Another performance issue with virtiofs could be due to the strict

> page writeback rules in fuse that are meant to prevent misuse of

> kernel memory by unprivileged processes.   Since virtiofs isn't

> subject to that limitation, these could be relaxed.

Hi Miklos,

I tried this patch with some of the write mmap workloads and it seems
to help. I ran virtiofsd with following options.

$ virtiofsd --socket-path=/tmp/vhostqemu -o source=/mnt//virtiofs-source -o no_posix_lock  -o cache=auto --thread-pool-size=0 --daemonize

Note, these workloads are not doing any fsync after write. So
they are effectively testing how fast one can do cached writes.

NAME                    WORKLOAD                Bandwidth       IOPS
limit-bdi               seqwrite-mmap           201.4mb         50.3k
nolimit-bdi             seqwrite-mmap           253.4mb         63.3k

limit-bdi               seqwrite-mmap-multi     386.7mb         96.6k
nolimit-bdi             seqwrite-mmap-multi     752.2mb         188.0k

limit-bdi               randwrite-mmap          53.5mb          13.3k
nolimit-bdi             randwrite-mmap          60.6mb          15.1k

limit-bdi               randwrite-mmap-multi    206.3mb         51.5k
nolimit-bdi             randwrite-mmap-multi    255.5mb         63.8k

seqwrite-mmap-multi seems to see the biggest jump.

So it might be a good idea to apply this patch.

> 

> Attaching a patch that does one half of this.  The other half is

> getting rid of the page copying, that's a bit more involved, but

> shouldn't be too difficult.  Just need to duplicate the cached

> writeback callbacks for virtiofs and do away with the complex temp

> page stuff.

Aha..., so we don't need all the complexity related to allocating
those temporary pages and then keeping track of writes in rb tree
and waiting for writes to finish etc. And it could be more like
a regular filesystem where lot of this stuff could be taken care
by common code for the case of virtiofs. That will be really nice.
Less code complexity to deal with. Also might provide performance
improvement.

Thanks
Vivek

Miklos Szeredi Nov. 23, 2020, 9:46 a.m. UTC | #18

On Wed, Nov 18, 2020 at 8:52 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>

> On Thu, Nov 12, 2020 at 10:06:37AM +0100, Miklos Szeredi wrote:

> > On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> > >

> > > On Fri, Nov 06, 2020 at 08:33:50PM +0000, Venegas Munoz, Jose Carlos wrote:

> > > > Hi Vivek,

> > > >

> > > > I have tested with Kata 1.12-apha0, the results seems that are better for the use fio config I am tracking.

> > > >

> > > > The fio config does  randrw:

> > > >

> > > > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75

> > > >

> > >

> > > Hi Carlos,

> > >

> > > Thanks for the testing.

> > >

> > > So basically two conclusions from your tests.

> > >

> > > - for virtiofs, --thread-pool-size=0 is performing better as comapred

> > >   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately

> > >   35-40% better.

> > >

> > > - virtio-9p is still approximately 30% better than virtiofs

> > >   --thread-pool-size=0.

> > >

> > > As I had done the analysis that this particular workload (mixed read and

> > > write) is bad with virtiofs because after every write we are invalidating

> > > attrs and cache so next read ends up fetching attrs again. I had posted

> > > patches to gain some of the performance.

> > >

> > > https://lore.kernel.org/linux-fsdevel/20200929185015.GG220516@redhat.com/

> > >

> > > But I got the feedback to look into implementing file leases instead.

> >

> > Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it

> > off for now?   9p doesn't have it, so no point in enabling it for

> > virtiofs by default.

>

> If we disable FUSE_AUTO_INVAL_DATA, then client page cache will not

> be invalidated even after 1 sec, right? (for cache=auto).


Unless FOPEN_KEEP_CACHE is used (only cache=always, AFAICS) data cache
will be invalidated on open.

I think it's what NFS does, except NFS does invalidation based on
mtime *at the time of open*.   At least that's what I remember from
past reading of NFS code.

> Given now we also want to target sharing directory tree among multiple

> clients, keeping FUSE_AUTO_INVAL_DATA enabled should help.


Depends what applications expect.  THe close-to-open coherency
provided even without FUSE_AUTO_INVAL_DATA should be enough for the
case when strong coherency is not required.

>

> >

> > Also I think some confusion comes from cache=auto being the default

> > for virtiofs.    Not sure what the default is for 9p, but comparing

> > default to default will definitely not be apples to apples since this

> > mode is nonexistent in 9p.

> >

> > 9p:cache=none  <-> virtiofs:cache=none

> > 9p:cache=loose <-> virtiofs:cache=always

> >

> > "9p:cache=mmap" and "virtiofs:cache=auto" have no match.

>

> Agreed from performance comparison point of view.

>

> I will prefer cache=auto default (over cache=none) for virtiofsd. During

> some kernel compilation tests over virtiofs, cache=none was painfully

> slow as compared to cache=auto.


It would also be interesting to know the exact bottleneck in the
kernel compile with cache=none case.

Thanks,
Miklos

Venegas Munoz, Jose Carlos Nov. 23, 2020, 3:06 p.m. UTC | #19

For all the cases the memory for the guest is 2G.


On 17/11/20 12:55, "Vivek Goyal" <vgoyal@redhat.com> wrote:

    On Tue, Nov 17, 2020 at 04:00:09PM +0000, Venegas Munoz, Jose Carlos wrote:
    > >   Not sure what the default is for 9p, but comparing

    > >   default to default will definitely not be apples to apples since this

    > >   mode is nonexistent in 9p.

    > 

    > In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples,

    > we are for the best config for both and comparing the best that each of them can do.

    > 

    > In the case of Kata for 9pfs(this is the config we have found has better performance and fs compatibility  in general) we have:

    > ```

    > -device virtio-9p-pci # device type

    > ,disable-modern=false 

    > ,fsdev=extra-9p-kataShared # attr: device id  for fsdev

    > ,mount_tag=kataShared  # attr: tag on how will be found de sharedfs 

    > ,romfile= 

    > -fsdev local  #local: Simply lets QEMU call the individual VFS functions (more or less) directly on host.

    > ,id=extra-9p-kataShared 

    > ,path=${SHARED_PATH} # attrs: path to share

    > ,security_model=none #    

    > #    passthrough: Files are stored using the same credentials as they are created on the guest. This requires QEMU to run as # root.

    > #    none: Same as "passthrough" except the sever won't report failures if it fails to set file attributes like ownership # 

    > #  (chown). This makes a passthrough like security model usable for people who run kvm as non root.

    > ,multidevs=remap

    > ```

    > 

    > The mount options are:

    > ```

    > trans=virtio 

    >     ,version=9p2000.L 

    >     ,cache=mmap 

    >     ,"nodev" # Security: The nodev mount option specifies that the filesystem cannot contain special devices. 

    >     ,"msize=8192" # msize: Maximum packet size including any headers.

    > ```


    How much RAM you are giving to these containers when using virtio-9p?

    Vivek

Venegas Munoz, Jose Carlos Nov. 23, 2020, 3:07 p.m. UTC | #20

Cool, thanks I will bring some results this week.

On 17/11/20 16:24, "Vivek Goyal" <vgoyal@redhat.com> wrote:

    On Tue, Nov 17, 2020 at 04:00:09PM +0000, Venegas Munoz, Jose Carlos wrote:
    > >   Not sure what the default is for 9p, but comparing

    > >   default to default will definitely not be apples to apples since this

    > >   mode is nonexistent in 9p.

    > 

    > In Kata we are looking for the best config for fs compatibility and performance. So even if are no apples to apples,

    > we are for the best config for both and comparing the best that each of them can do.


    Can we run tests in more than one configuration. Right now you are using
    cache=mmap for virtio-9p and cache=auto for virtiofs and as Miklos
    said this is not apples to apples comparison.

    So you can continue to run above but also run additional tests if
    time permits.

    virtio-9p	virtio-fs

    cache=mmap	cache=none + DAX
    cache=none	cache=none
    cache=loose	cache=always

    Given you are using cache=mmap for virtio-9p, "cache=none + DAX" is
    somewhat equivalent of that. Provides strong coherency as well as
    allow for mmap() to work.

    Now kernel virtiofs DAX support got merged in 5.10-rc1. For qemu,
    you can use virtio-fs-dev branch.

    https://gitlab.com/virtio-fs/qemu/-/commits/virtio-fs-dev

    Thanks
    Vivek

virtiofsd: Use --thread-pool-size=0 to mean no thread pool

Commit Message

Comments

Patch