From patchwork Sun May 11 15:28:07 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Julien Grall <julien.grall@linaro.org>
X-Patchwork-Id: 29949
Return-Path: <patchwork-forward+bncBCI3N6FH3YJRBTORX2NQKGQED4SMFXY@linaro.org>
X-Original-To: linaro@patches.linaro.org
Delivered-To: linaro@patches.linaro.org
Received: from mail-vc0-f198.google.com (mail-vc0-f198.google.com
 [209.85.220.198])
 by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id E41AE208CF
 for <linaro@patches.linaro.org>; Sun, 11 May 2014 16:43:57 +0000 (UTC)
Received: by mail-vc0-f198.google.com with SMTP id ij19sf21119687vcb.9
 for <linaro@patches.linaro.org>; Sun, 11 May 2014 09:43:57 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:delivered-to:message-id:date:from:user-agent
 :mime-version:to:references:in-reply-to:cc:subject:precedence
 :list-id:list-unsubscribe:list-post:list-help:list-subscribe:sender
 :errors-to:x-original-sender:x-original-authentication-results
 :mailing-list:list-archive:content-type:content-transfer-encoding;
 bh=6UdyOuAldgdO983JxYpgnxNzR937ErDnmHCRfkpy6Wc=;
 b=KUSvVpYxTDOKeZ9PF8fZ7WW1v9W9y8rG6iu6RfL3TwMPQWZ0cDyMmPB3fL8DISkRRA
 IJjeU/67sY0UW6gLBCulDX0vR7bQbwHHItN2fQ3aIS/TvDTxQQg9P+f9k72RkBlFuulq
 mBv2v5LZNnhdwavmsXN/vYNwl2Qcm7xo6FuihMntCLaO5kPaZRR/flVekHxk4oXigNWK
 V+YiZb2JH+gHT14fcdhPPFhhuG6gqItYMGb2hqczheaGRqznx/tc5lF6v5ceRylnFZON
 WwllRx6lEaRxEEJeZ6WgnEmI/prcSBjk8ZhZhFfbAp7h/JgGUaqY6LQtZ1xdXBH9TcNw
 Gfsw==
X-Gm-Message-State: ALoCoQn8Wj2eas6iLWf/ylyZd7I4XBFyZSYGEM8NmIWxCA0wE+UgGEE1KuVvLHXlp0YC1Qf9w8jx
X-Received: by 10.224.36.137 with SMTP id t9mr11018724qad.4.1399826637588;
 Sun, 11 May 2014 09:43:57 -0700 (PDT)
X-BeenThere: patchwork-forward@linaro.org
Received: by 10.140.93.131 with SMTP id d3ls1064704qge.52.gmail; Sun, 11 May
 2014 09:43:57 -0700 (PDT)
X-Received: by 10.220.106.7 with SMTP id v7mr426729vco.46.1399826637492;
 Sun, 11 May 2014 09:43:57 -0700 (PDT)
Received: from mail-ve0-f174.google.com (mail-ve0-f174.google.com
 [209.85.128.174]) by mx.google.com with ESMTPS id
 gq7si1676828vdc.122.2014.05.11.09.43.57
 for <patchwork-forward@linaro.org>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Sun, 11 May 2014 09:43:57 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 patch+caf_=patchwork-forward=linaro.org@linaro.org designates
 209.85.128.174 as permitted sender) client-ip=209.85.128.174; 
Received: by mail-ve0-f174.google.com with SMTP id jw12so7565300veb.19
 for <patchwork-forward@linaro.org>;
 Sun, 11 May 2014 09:43:57 -0700 (PDT)
X-Received: by 10.58.243.39 with SMTP id wv7mr433230vec.51.1399826637274;
 Sun, 11 May 2014 09:43:57 -0700 (PDT)
X-Forwarded-To: patchwork-forward@linaro.org
X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org
Delivered-To: patch@linaro.org
Received: by 10.220.221.72 with SMTP id ib8csp11451vcb;
 Sun, 11 May 2014 09:43:56 -0700 (PDT)
X-Received: by 10.52.104.72 with SMTP id gc8mr359388vdb.48.1399826636544;
 Sun, 11 May 2014 09:43:56 -0700 (PDT)
Received: from lists.xen.org (lists.xen.org. [50.57.142.19])
 by mx.google.com with ESMTPS id
 tj4si1675503vdc.87.2014.05.11.09.43.56 for <multiple recipients>
 (version=TLSv1 cipher=RC4-SHA bits=128/128);
 Sun, 11 May 2014 09:43:56 -0700 (PDT)
Received-SPF: none (google.com: xen-devel-bounces@lists.xen.org does not
 designate permitted sender hosts) client-ip=50.57.142.19; 
Received: from localhost ([127.0.0.1] helo=lists.xen.org)
 by lists.xen.org with esmtp (Exim 4.72)
 (envelope-from <xen-devel-bounces@lists.xen.org>)
 id 1WjWqK-000630-CS; Sun, 11 May 2014 16:43:16 +0000
Received: from mail6.bemta5.messagelabs.com ([195.245.231.135])
 by lists.xen.org with esmtp (Exim 4.72)
 (envelope-from <julien.grall@linaro.org>) id 1WjWqI-00062s-MP
 for xen-devel@lists.xen.org; Sun, 11 May 2014 16:43:15 +0000
Received: from [85.158.139.211:40747] by server-7.bemta-5.messagelabs.com id
 DA/A8-20531-2A8AF635; Sun, 11 May 2014 16:43:14 +0000
X-Env-Sender: julien.grall@linaro.org
X-Msg-Ref: server-8.tower-206.messagelabs.com!1399826591!3519511!1
X-Originating-IP: [74.125.82.175]
X-SpamReason: No, hits=0.0 required=7.0 tests=
X-StarScan-Received: 
X-StarScan-Version: 6.11.3; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 2169 invoked from network); 11 May 2014 16:43:12 -0000
Received: from mail-we0-f175.google.com (HELO mail-we0-f175.google.com)
 (74.125.82.175)
 by server-8.tower-206.messagelabs.com with RC4-SHA encrypted SMTP;
 11 May 2014 16:43:12 -0000
Received: by mail-we0-f175.google.com with SMTP id t61so5977617wes.34
 for <xen-devel@lists.xen.org>; Sun, 11 May 2014 09:43:11 -0700 (PDT)
X-Received: by 10.180.93.226 with SMTP id cx2mr12125600wib.16.1399826591737; 
 Sun, 11 May 2014 09:43:11 -0700 (PDT)
Received: from [192.168.42.89] (dab-ntm1-h-8-3.dab.02.net. [82.132.229.62])
 by mx.google.com with ESMTPSA id
 12sm14248569wju.48.2014.05.11.09.40.43 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Sun, 11 May 2014 09:43:11 -0700 (PDT)
Message-ID: <536F9707.4060807@linaro.org>
Date: Sun, 11 May 2014 16:28:07 +0100
From: Julien Grall <julien.grall@linaro.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: Wei Huang <w1.huang@samsung.com>, xen-devel@lists.xen.org
References: <1399583908-21755-1-git-send-email-w1.huang@samsung.com>
 <1399583908-21755-6-git-send-email-w1.huang@samsung.com>
In-Reply-To: <1399583908-21755-6-git-send-email-w1.huang@samsung.com>
Cc: keir@xen.org, ian.campbell@citrix.com, stefano.stabellini@eu.citrix.com, 
 andrew.cooper3@citrix.com, tim@xen.org, jaeyong.yoo@samsung.com,
 jbeulich@suse.com, ian.jackson@eu.citrix.com, yjhyun.yoo@samsung.com
Subject: Re: [Xen-devel] [RFC v3 5/6] xen/arm: Add log_dirty support for ARM
X-BeenThere: xen-devel@lists.xen.org
X-Mailman-Version: 2.1.13
Precedence: list
List-Id: <patchwork-forward.linaro.org>
List-Unsubscribe: <http://groups.google.com/a/linaro.org/group/patchwork-forward/subscribe>, 
 <mailto:googlegroups-manage+836684582541+unsubscribe@googlegroups.com>
List-Post: <http://groups.google.com/a/linaro.org/group/patchwork-forward/post>, 
 <mailto:patchwork-forward@linaro.org>
List-Help: <http://support.google.com/a/linaro.org/bin/topic.py?topic=25838>, 
 <mailto:patchwork-forward+help@linaro.org>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
X-Removed-Original-Auth: Dkim didn't pass.
X-Original-Sender: julien.grall@linaro.org
X-Original-Authentication-Results: mx.google.com; spf=pass (google.com:
 domain of
 patch+caf_=patchwork-forward=linaro.org@linaro.org designates
 209.85.128.174 as permitted sender)
 smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org
Mailing-list: list patchwork-forward@linaro.org;
 contact patchwork-forward+owners@linaro.org
X-Google-Group-Id: 836684582541
List-Archive: <http://groups.google.com/a/linaro.org/group/patchwork-forward/>

Hi Wei,

Thank you for the patch. It seems you didn't address most of my comment
I made on V2.

I will try to repeat all of them here. Next time, please read email
on the previous version before sending a new one.

On 08/05/14 22:18, Wei Huang wrote:
> This patch implements log_dirty for ARM guest VMs. This feature
> is provided via two basic blocks: dirty_bit_map and VLPT
> (virtual-linear page table)
>
> 1. VLPT provides fast accessing of 3rd PTE of guest P2M.
> When creating a mapping for VLPT, the page table mapping
> becomes the following:
>     xen's 1st PTE --> xen's 2nd PTE --> guest p2m's 2nd PTE -->
>     guest p2m's 3rd PTE
>
> With VLPT, xen can immediately locate the 3rd PTE of guest P2M
> and modify PTE attirbute during dirty page tracking. The following

attribute

> link shows the performance comparison for handling a dirty-page
> between VLPT and typical page table walking.
> http://lists.xen.org/archives/html/xen-devel/2013-08/msg01503.html
>
> For more info about VLPT, please see
> http://www.technovelty.org/linux/virtual-linear-page-table.html.
>
> 2. Dirty bitmap
> The dirty bitmap is used to mark the pages which are dirty during
> migration. The info is used by Xen tools, via DOMCTL_SHADOW_OP_*,
> to figure out which guest pages need to be resent.
>
> Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com>
> Signed-off-by: Evgeny Fedotov <e.fedotov@samsung.com>
> Signed-off-by: Wei Huang <w1.huang@samsung.com>
> ---
>   xen/arch/arm/domain.c           |    6 +
>   xen/arch/arm/domctl.c           |   31 +++-
>   xen/arch/arm/mm.c               |  298 ++++++++++++++++++++++++++++++++++++++-

As said on v2, the functions you have added is P2M related not Xen memory
related. I think they should  be moved in p2m.c

[..]

> +/* Return start and end addr of guest RAM. Note this function only reports
> + * regular RAM. It does not cover other areas such as foreign mapped
> + * pages or MMIO space. */
> +void domain_get_ram_range(struct domain *d, paddr_t *start, paddr_t *end)
> +{
> +    if ( start )
> +        *start = GUEST_RAM_BASE;
> +
> +    if ( end )
> +        *end = GUEST_RAM_BASE + ((paddr_t) d->max_pages << PAGE_SHIFT);
> +}

As said on V1 this solution won't work.

Ian plans to add multiple banks support for the guest very soon. With this
solution there is a 1GB hole between the 2 banks. Your function will therefore
stop to work.

Furthermore, Xen should not assume that the layout of the guest will always start
at GUEST_RAM_BASE.

I think you can use max_mapped_pfn and lowest_mapped_pfn here. You may need to
modify a bit the signification of it in the p2m code or introduce a new field.

[..]

> +/* Allocate dirty bitmap resource */
> +static int bitmap_init(struct domain *d)
> +{
> +    paddr_t gma_start = 0;
> +    paddr_t gma_end = 0;
> +    int nr_bytes;
> +    int nr_pages;
> +    int i;
> +
> +    domain_get_ram_range(d, &gma_start, &gma_end);
> +
> +    nr_bytes = (PFN_DOWN(gma_end - gma_start) + 7) / 8;
> +    nr_pages = (nr_bytes + PAGE_SIZE - 1) / PAGE_SIZE;
> +
> +    BUG_ON(nr_pages > MAX_DIRTY_BITMAP_PAGES);

AFAIU, you will crash Xen is the user is trying to migrate a guest with more than 8GB of RAM, right?

If so, you should instead return an error.

[..]

> +    /* Check if reserved space is enough to cover guest physical address space.
> +     * Note that each LPAE page table entry is 64-bit (8 bytes). So we only
> +     * shift left with LPAE_SHIFT instead of PAGE_SHIFT. */
> +    domain_get_ram_range(d, &gma_start, &gma_end);
> +    required = (gma_end - gma_start) >> LPAE_SHIFT;
> +    if ( required > avail )

What is the maximum amount of RAM a guest can use if we want to migrate it?

> +    {
> +        dprintk(XENLOG_ERR, "Available VLPT is small for domU guest (avail: "
> +                "%#llx, required: %#llx)\n", (unsigned long long)avail,
> +                (unsigned long long)required);

You don't need cast here.

> +        return -ENOMEM;
> +    }
> +
> +    /* Caulculate the base of 2nd linear table base for VIRT_LIN_P2M_START */

Calculate

> +    xen_second_linear_base = second_linear_offset(VIRT_LIN_P2M_START);
> +
> +    gp2m_start_index = gma_start >> FIRST_SHIFT;
> +    gp2m_end_index = (gma_end >> FIRST_SHIFT) + 1;
> +
> +    if ( xen_second_linear_base + gp2m_end_index >= LPAE_ENTRIES * 2 )
> +    {

In which case this thing happen?

> +        dprintk(XENLOG_ERR, "xen second page is small for VLPT for domU");
> +        return -ENOMEM;
> +    }
> +
> +    /* Two pages are allocated to backup the related PTE content of guest
> +     * VM's 1st-level table. */
> +    second_lvl_page = alloc_domheap_pages(NULL, 1, 0);
> +    if ( second_lvl_page == NULL )
> +        return -ENOMEM;
> +    d->arch.dirty.second_lvl[0] = map_domain_page_global(
> +        page_to_mfn(second_lvl_page) );
> +    d->arch.dirty.second_lvl[1] = map_domain_page_global(
> +        page_to_mfn(second_lvl_page+1) );

map_domain_page_global can fail.

> +
> +    /* 1st level P2M of guest VM is 2 consecutive pages */
> +    first[0] = __map_domain_page(p2m->first_level);
> +    first[1] = __map_domain_page(p2m->first_level+1);
> +
> +    for ( i = gp2m_start_index; i < gp2m_end_index; ++i )
> +    {
> +        int k = i % LPAE_ENTRIES;
> +        int l = i / LPAE_ENTRIES;
> +        int k2 = (xen_second_linear_base + i) % LPAE_ENTRIES;
> +        int l2 = (xen_second_linear_base + i) / LPAE_ENTRIES;
> +
> +        /* Update 2nd-level PTE of Xen linear table. With this, Xen linear
> +         * page table layout becomes: 1st Xen linear ==> 2nd Xen linear ==>
> +         * 2nd guest P2M (i.e. 3rd Xen linear) ==> 3rd guest P2M (i.e. Xen
> +         * linear content) for VIRT_LIN_P2M_START address space. */
> +        write_pte(&xen_second[xen_second_linear_base+i], first[l][k]);

Space around binary operator.

[..]

> +/* Restore the xen page table for vlpt mapping for domain */
> +void log_dirty_restore(struct domain *d)
> +{
> +    int i;
> +
> +    /* Nothing to do as log dirty mode is off */
> +    if ( !(d->arch.dirty.mode) )

Your VLPT implementation uses xen_second, which is shared between every pCPU.
This will restrict to migrate only one guest at the time. Therefore restoring
the VLPT seems pointless.

In another hand, I didn't see anything in your patch series which prevent this
case. You will likely corrupt one (if not both) guest.

You have to create per-VCPU mapping for your VLPT solution.

> +        return;
> +
> +    dsb(sy);

I think inner-shareable (ish) is enough here.

> +
> +    for ( i = d->arch.dirty.second_lvl_start; i < d->arch.dirty.second_lvl_end;
> +          ++i )
> +    {
> +        int k = i % LPAE_ENTRIES;
> +        int l = i / LPAE_ENTRIES;
> +
> +        if ( xen_second[i].bits != d->arch.dirty.second_lvl[l][k].bits )
> +        {
> +            write_pte(&xen_second[i], d->arch.dirty.second_lvl[l][k]);
> +            flush_xen_data_tlb_range_va(i << SECOND_SHIFT, 1 << SECOND_SHIFT);
> +        }
> +    }
> +
> +    dsb(sy);

Same here.

> +    isb();
> +}
> +
> +/* Turn on log dirty */
> +int log_dirty_on(struct domain *d)
> +{
> +    if ( vlpt_init(d) || bitmap_init(d) )

You have to call vlpt_cleanup if bitmap_init fail. Otherwise,
will let some page mapped via domain global.

> +        return -EINVAL;
> +
> +    return 0;
> +}
> +
> +/* Turn off log dirty */
> +void log_dirty_off(struct domain *d)
> +{
> +    bitmap_cleanup(d);
> +    vlpt_cleanup(d);
> +}
> +
> +/* Initialize log dirty fields */
> +int log_dirty_init(struct domain *d)

You don't check the return in arch_domain_create. Therefore,
your return value should be void.

> +{
> +    d->arch.dirty.count = 0;
> +    d->arch.dirty.mode = 0;
> +    spin_lock_init(&d->arch.dirty.lock);
> +
> +    d->arch.dirty.second_lvl_start = 0;
> +    d->arch.dirty.second_lvl_end = 0;
> +    d->arch.dirty.second_lvl[0] = NULL;
> +    d->arch.dirty.second_lvl[1] = NULL;
> +
> +    memset(d->arch.dirty.bitmap, 0, sizeof(d->arch.dirty.bitmap));
> +    d->arch.dirty.bitmap_pages = 0;
> +
> +    return 0;
> +}
> +
> +/* Log dirty tear down */
> +void log_dirty_teardown(struct domain *d)
> +{

I think nothing prevents to destroy a domain with log dirty on.
You should release all resources you've allocated for this domain.
Otherwise, Xen will leak memory.

> +    return;
> +}
> +
>   /*
>    * Local variables:
>    * mode: C
> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 603c097..0808cc9 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -6,6 +6,8 @@
>   #include <xen/bitops.h>
>   #include <asm/flushtlb.h>
>   #include <asm/gic.h>
> +#include <xen/guest_access.h>
> +#include <xen/pfn.h>
>   #include <asm/event.h>
>   #include <asm/hardirq.h>
>   #include <asm/page.h>
> @@ -208,6 +210,7 @@ static lpae_t mfn_to_p2m_entry(unsigned long mfn, unsigned int mattr,
>           break;
>
>       case p2m_ram_ro:
> +    case p2m_ram_logdirty:
>           e.p2m.xn = 0;
>           e.p2m.write = 0;
>           break;
> @@ -261,6 +264,10 @@ static int p2m_create_table(struct domain *d,
>
>       pte = mfn_to_p2m_entry(page_to_mfn(page), MATTR_MEM, p2m_invalid);
>
> +    /* mark the write bit (page table's case, ro bit) as 0
> +     * so, it is writable in case of vlpt access */

mark the entry as write-only?

> +    pte.pt.ro = 0;
> +
>       write_pte(entry, pte);
>
>       return 0;
> @@ -696,6 +703,203 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn)
>       return p >> PAGE_SHIFT;
>   }
>
> +/* Change types across all p2m entries in a domain */
> +void p2m_change_entry_type_global(struct domain *d, enum mg nt)
> +{

Can't you reuse apply_p2m_changes? I'm also concerned about preemption.
This function might be very long to run (depending on the size of the memory).

> +    struct p2m_domain *p2m = &d->arch.p2m;
> +    paddr_t ram_base;
> +    int i1, i2, i3;
> +    int first_index, second_index, third_index;
> +    lpae_t *first = __map_domain_page(p2m->first_level);
> +    lpae_t pte, *second = NULL, *third = NULL;
> +
> +    domain_get_ram_range(d, &ram_base, NULL);
> +
> +    first_index = first_table_offset((uint64_t)ram_base);
> +    second_index = second_table_offset((uint64_t)ram_base);
> +    third_index = third_table_offset((uint64_t)ram_base);
> +
> +    BUG_ON(!first);

__map_domain_page always return a valid pointer. This BUG_ON is pointless.

> +    spin_lock(&p2m->lock);
> +
> +    for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )
> +    {
> +        lpae_walk_t first_pte = first[i1].walk;
> +        if ( !first_pte.valid || !first_pte.table )
> +            goto out;
> +
> +        second = map_domain_page(first_pte.base);
> +        BUG_ON(!second);

Same remark as BUG_ON(!first).

> +        for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 )
> +        {
> +            lpae_walk_t second_pte = second[i2].walk;
> +
> +            if ( !second_pte.valid || !second_pte.table )
> +                goto out;

With Ian's multiple bank support, the RAM region (as returned by domain_get_range)
can contain a hole. Rather than leaving the loop, you should continue.

> +
> +            third = map_domain_page(second_pte.base);
> +            BUG_ON(!third);

Same remark as BUG_ON(!third).

> +
> +            for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 )
> +            {
> +                lpae_walk_t third_pte = third[i3].walk;
> +
> +                if ( !third_pte.valid )
> +                    goto out;
> +
> +                pte = third[i3];
> +
> +                if ( nt == mg_ro )

I would use a switch case for nt. It will be clearer and easier to extend.

> +                {
> +                    if ( pte.p2m.write == 1 )

We only want to trap read-write RAM region. Your solution may also trap MMIO, which is completely wrong.

I would replace by if ( pte.p2m.type == p2m_ram_rw ).

> +                    {
> +                        pte.p2m.write = 0;
> +                        pte.p2m.type = p2m_ram_logdirty;
> +                    }
> +                    else
> +                    {
> +                        /* reuse avail bit as an indicator of 'actual'
> +                         * read-only */

Spurious comment?

> +                        pte.p2m.type = p2m_ram_rw;

Why do you unconditionally change the type?

> +                    }
> +                }
> +                else if ( nt == mg_rw )
> +                {
> +                    if ( pte.p2m.write == 0 &&
> +                         pte.p2m.type == p2m_ram_logdirty )

Can you add a comment to say what does it mean?

> +                    {
> +                        pte.p2m.write = p2m_ram_rw;

Wrong field?

> +                    }
> +                }
> +                write_pte(&third[i3], pte);
> +            }
> +            unmap_domain_page(third);
> +
> +            third = NULL;
> +            third_index = 0;
> +        }
> +        unmap_domain_page(second);
> +
> +        second = NULL;
> +        second_index = 0;
> +        third_index = 0;
> +    }
> +
> +out:
> +    flush_tlb_all_local();

You want to flush the P2M on every CPU and only for the current VMID.

I've introduced a function flush_tlb_domain to do the job for you.
I haven't yet send the patch (see [1] at the end of the mail).

> +    if ( third ) unmap_domain_page(third);
> +    if ( second ) unmap_domain_page(second);
> +    if ( first ) unmap_domain_page(first);
> +
> +    spin_unlock(&p2m->lock);
> +}
> +
> +/* Read a domain's log-dirty bitmap and stats. If the operation is a CLEAN,
> + * clear the bitmap and stats. */
> +int log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc)
> +{
> +    int peek = 1;
> +    int i;
> +    int bitmap_size;
> +    paddr_t gma_start, gma_end;
> +
> +    /* this hypercall is called from domain 0, and we don't know which guest's

What does prevent to call this hypercall from another domain than 0?

> +     * vlpt is mapped in xen_second, so, to be sure, we restore vlpt here */
> +    log_dirty_restore(d);
> +
> +    domain_get_ram_range(d, &gma_start, &gma_end);
> +    bitmap_size = (gma_end - gma_start) / 8;
> +
> +    if ( guest_handle_is_null(sc->dirty_bitmap) )
> +    {
> +        peek = 0;

Hrrrm... you set peek to but never use it.

> +    }

Bracket are not necessary here.

> +    else
> +    {
> +        spin_lock(&d->arch.dirty.lock);
> +
> +        for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i )

Why not i++?

> +        {
> +            int j = 0;
> +            uint8_t *bitmap;
> +
> +            copy_to_guest_offset(sc->dirty_bitmap, i * PAGE_SIZE,
> +                                 d->arch.dirty.bitmap[i],
> +                                 bitmap_size < PAGE_SIZE ? bitmap_size :
> +                                                           PAGE_SIZE);

Where do you check sc->dirty_bitmap as enough space to store the guest dirty bitmap?

You also forget to update sc->pages.

> +            bitmap_size -= PAGE_SIZE;
> +
> +            /* set p2m page table read-only */
> +            bitmap = d->arch.dirty.bitmap[i];
> +            while ((j = find_next_bit((const long unsigned int *)bitmap,
> +                                      PAGE_SIZE*8, j)) < PAGE_SIZE*8)
> +            {
> +                lpae_t *vlpt;
> +                paddr_t addr = gma_start + (i << (2*PAGE_SHIFT+3)) +
> +                    (j << PAGE_SHIFT);
> +                vlpt = vlpt_get_3lvl_pte(addr);
> +                vlpt->p2m.write = 0;
> +                j++;
> +            }
> +        }
> +
> +        if ( sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN )
> +        {

I suspect this if is in the wrong place. I think, when sc->op is equal to XEN_DOMCTL_SHADOW_OP_CLEAN, the buffer is NULL.

Clean should also clean d->arch.dirty.count ...

> +            for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i )
> +            {
> +                clear_page(d->arch.dirty.bitmap[i]);
> +            }

Bracket are not necessary here.

[..]

> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> index df4d375..b652565 100644
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -1556,6 +1556,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>       struct hsr_dabt dabt = hsr.dabt;
>       int rc;
>       mmio_info_t info;
> +    int page_fault = ( dabt.write && ((dabt.dfsc & FSC_MASK) ==
> +                                      (FSC_FLT_PERM|FSC_3RD_LEVEL)) );
>
>       if ( !check_conditional_instr(regs, hsr) )
>       {
> @@ -1577,6 +1579,13 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>       if ( rc == -EFAULT )
>           goto bad_data_abort;
>
> +    /* domU page fault handling for guest live migration. Note that

I would remove domU in the comment.

> +     * dabt.valid can be 0 here */
> +    if ( page_fault && handle_page_fault(current->domain, info.gpa) )
> +    {
> +        /* Do not modify PC as guest needs to repeat memory operation */
> +        return;
> +    }
>       /* XXX: Decode the instruction if ISS is not valid */
>       if ( !dabt.valid )
>           goto bad_data_abort;
> diff --git a/xen/include/asm-arm/config.h b/xen/include/asm-arm/config.h
> index ef291ff..f18fae4 100644
> --- a/xen/include/asm-arm/config.h
> +++ b/xen/include/asm-arm/config.h
> @@ -87,6 +87,7 @@
>    *   0  -   8M   <COMMON>
>    *
>    *  32M - 128M   Frametable: 24 bytes per page for 16GB of RAM
> + * 128M - 256M   Virtual-linear mapping to P2M table
>    * 256M -   1G   VMAP: ioremap and early_ioremap use this virtual address
>    *                    space
>    *
> @@ -124,13 +125,15 @@
>   #define CONFIG_SEPARATE_XENHEAP 1
>
>   #define FRAMETABLE_VIRT_START  _AT(vaddr_t,0x02000000)
> -#define VMAP_VIRT_START  _AT(vaddr_t,0x10000000)
> +#define VIRT_LIN_P2M_START     _AT(vaddr_t,0x08000000)
> +#define VMAP_VIRT_START        _AT(vaddr_t,0x10000000)
> +#define VIRT_LIN_P2M_END       VMAP_VIRT_START

Should not it be VMAP_VIRT_START - 1?

I would also directly use _AT(vaddr_t,0x0FFFFFFF) to stay consistent with the other *_END define.

>   #define XENHEAP_VIRT_START     _AT(vaddr_t,0x40000000)
>   #define XENHEAP_VIRT_END       _AT(vaddr_t,0x7fffffff)
>   #define DOMHEAP_VIRT_START     _AT(vaddr_t,0x80000000)
>   #define DOMHEAP_VIRT_END       _AT(vaddr_t,0xffffffff)
>
> -#define VMAP_VIRT_END    XENHEAP_VIRT_START
> +#define VMAP_VIRT_END          XENHEAP_VIRT_START

Spurious changes?

>   #define DOMHEAP_ENTRIES        1024  /* 1024 2MB mapping slots */
>
> @@ -157,6 +160,11 @@
>
>   #define HYPERVISOR_VIRT_END    DIRECTMAP_VIRT_END
>
> +/* Definition for VIRT_LIN_P2M_START and VIRT_LIN_P2M_END (64-bit)
> + * TODO: Needs evaluation. */

Can you update the layout description for ARM64 above?

> +#define VIRT_LIN_P2M_START     _AT(vaddr_t, 0x08000000)
> +#define VIRT_LIN_P2M_END       VMAP_VIRT_START
> +
>   #endif
>
>   /* Fixmap slots */
> diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
> index aabeb51..ac82643 100644
> --- a/xen/include/asm-arm/domain.h
> +++ b/xen/include/asm-arm/domain.h
> @@ -162,6 +162,25 @@ struct arch_domain
>       } vuart;
>
>       unsigned int evtchn_irq;
> +
> +    /* dirty page tracing */
> +    struct {
> +        spinlock_t lock;
> +        volatile int mode;               /* 1 if dirty pages tracing enabled */
> +        volatile unsigned int count;     /* dirty pages counter */
> +
> +        /* vlpt context switch */
> +        volatile int second_lvl_start; /* start idx of virt linear space 2nd */
> +        volatile int second_lvl_end;   /* end idx of virt linear space 2nd */

Why this 4 fields must be volatile?

> +        lpae_t *second_lvl[2];         /* copy of guest P2M 1st-lvl content */
> +
> +        /* bitmap to track dirty pages */
> +#define MAX_DIRTY_BITMAP_PAGES 64
> +        /* Because each bit represents a dirty page, the total supported guest
> +         * memory is (64 entries x 4KB/entry x 8bits/byte x 4KB) = 8GB. */
> +        uint8_t *bitmap[MAX_DIRTY_BITMAP_PAGES]; /* dirty bitmap */
> +        int bitmap_pages;                        /* # of dirty bitmap pages */
> +    } dirty;
>   }  __cacheline_aligned;
>
>   struct arch_vcpu
> diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
> index b8d4e7d..ab19025 100644
> --- a/xen/include/asm-arm/mm.h
> +++ b/xen/include/asm-arm/mm.h
> @@ -4,6 +4,7 @@
>   #include <xen/config.h>
>   #include <xen/kernel.h>
>   #include <asm/page.h>
> +#include <asm/config.h>
>   #include <public/xen.h>
>
>   /* Align Xen to a 2 MiB boundary. */
> @@ -320,6 +321,7 @@ int donate_page(
>   #define domain_clamp_alloc_bitsize(d, b) (b)
>
>   unsigned long domain_get_maximum_gpfn(struct domain *d);
> +void domain_get_ram_range(struct domain *d, paddr_t *start, paddr_t *end);
>   extern struct domain *dom_xen, *dom_io, *dom_cow;
>
> @@ -341,6 +343,27 @@ static inline void put_page_and_type(struct page_info *page)
>       put_page(page);
>   }
>
> +enum mg { mg_clear, mg_ro, mg_rw, mg_rx };

Please describe this enum. Also mg is too generic.

> +/************************************/
> +/*    Log-dirty support functions   */
> +/************************************/
> +int log_dirty_on(struct domain *d);
> +void log_dirty_off(struct domain *d);
> +int log_dirty_init(struct domain *d);
> +void log_dirty_teardown(struct domain *d);
> +void log_dirty_restore(struct domain *d);
> +int handle_page_fault(struct domain *d, paddr_t addr);
> +/* access leaf PTE of a given guest address (GPA) */
> +static inline lpae_t * vlpt_get_3lvl_pte(paddr_t addr)
> +{
> +    lpae_t *table = (lpae_t *)VIRT_LIN_P2M_START;
> +
> +    /* Since we slotted the guest's first p2m page table to xen's
> +     * second page table, one shift is enough for calculating the
> +     * index of guest p2m table entry */
> +    return &table[addr >> PAGE_SHIFT];
> +}

These functions should be part of p2m.h, not mm.h

> @@ -41,6 +42,7 @@ typedef enum {
>       p2m_invalid = 0,    /* Nothing mapped here */
>       p2m_ram_rw,         /* Normal read/write guest RAM */
>       p2m_ram_ro,         /* Read-only; writes are silently dropped */
> +    p2m_ram_logdirty,   /* Read-only: special mode for log dirty */

You should at the new type at the end of the enum.

>       p2m_mmio_direct,    /* Read/write mapping of genuine MMIO area */
>       p2m_map_foreign,    /* Ram pages from foreign domain */
>       p2m_grant_map_rw,   /* Read/write grant mapping */
> @@ -49,7 +51,8 @@ typedef enum {
>   } p2m_type_t;

Regards,


[1]

commit cebfd170dc13791df95fbb120c5894f0960e2804
Author: Julien Grall <julien.grall@linaro.org>
Date:   Mon Apr 28 16:34:21 2014 +0100

    xen/arm: Introduce flush_tlb_domain
    
    The pattern p2m_load_VTTBR(d) -> flush_tlb -> p2m_load_VTTBR(current->domain)
    is used in few places.
    
    Replace this usage by flush_tlb_domain which will take care of this pattern.
    This will help to the lisibility of apply_p2m_changes which begin to be big.
    
    Signed-off-by: Julien Grall <julien.grall@linaro.org>
diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 603c097..61450cf 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -72,6 +72,21 @@ void p2m_restore_state(struct vcpu *n)
     isb();
 }
 
+void flush_tlb_domain(struct domain *d)
+{
+    /* Update the VTTBR if necessary with the domain d. In this case,
+     * it's only necessary to flush TLBs on every CPUs with the current VMID
+     * (our domain).
+     */
+    if ( d != current->domain )
+        p2m_load_VTTBR(d);
+
+    flush_tlb();
+
+    if ( d != current->domain )
+        p2m_load_VTTBR(current->domain);
+}
+
 static int p2m_first_level_index(paddr_t addr)
 {
     /*
@@ -450,19 +465,7 @@ static int apply_p2m_changes(struct domain *d,
     }
 
     if ( flush )
-    {
-        /* Update the VTTBR if necessary with the domain where mappings
-         * are created. In this case it's only necessary to flush TLBs
-         * on every CPUs with the current VMID (our domain).
-         */
-        if ( d != current->domain )
-            p2m_load_VTTBR(d);
-
-        flush_tlb();
-
-        if ( d != current->domain )
-            p2m_load_VTTBR(current->domain);
-    }
+        flush_tlb_domain(d);
 
     if ( op == ALLOCATE || op == INSERT )
     {
@@ -550,14 +553,10 @@ int p2m_alloc_table(struct domain *d)
     d->arch.vttbr = page_to_maddr(p2m->first_level)
         | ((uint64_t)p2m->vmid&0xff)<<48;
 
-    p2m_load_VTTBR(d);
-
     /* Make sure that all TLBs corresponding to the new VMID are flushed
      * before using it
      */
-    flush_tlb();
-
-    p2m_load_VTTBR(current->domain);
+    flush_tlb_domain(d);
 
     spin_unlock(&p2m->lock);
 
diff --git a/xen/include/asm-arm/flushtlb.h b/xen/include/asm-arm/flushtlb.h
index 329fbb4..5722c67 100644
--- a/xen/include/asm-arm/flushtlb.h
+++ b/xen/include/asm-arm/flushtlb.h
@@ -25,6 +25,9 @@ do {                                                                    \
 /* Flush specified CPUs' TLBs */
 void flush_tlb_mask(const cpumask_t *mask);
 
+/* Flush CPU's TLBs for the speficied domain */
+void flush_tlb_domain(struct domain *d);
+
 #endif /* __ASM_ARM_FLUSHTLB_H__ */
 /*
  * Local variables: