From patchwork Sat Nov 19 00:56:43 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cesar Philippidis X-Patchwork-Id: 83045 Delivered-To: patch@linaro.org Received: by 10.140.97.165 with SMTP id m34csp382977qge; Fri, 18 Nov 2016 16:57:18 -0800 (PST) X-Received: by 10.37.174.65 with SMTP id g1mr1489838ybe.10.1479517038818; Fri, 18 Nov 2016 16:57:18 -0800 (PST) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id n7si2230991ywf.282.2016.11.18.16.57.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Nov 2016 16:57:18 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-442032-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org; spf=pass (google.com: domain of gcc-patches-return-442032-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-442032-patch=linaro.org@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:message-id:date:mime-version:content-type; q=dns; s= default; b=nBENjgpL5DbSTrXCcvXyYx2HDREHHrlODiC1nIn2TaWYkc6yfRlD7 +Iv1uKL1cFxjGiNqTUpXwnCJS4yyB6kExHHHCdnDRy09EFl46oarI/7yzJfzlQ9I KGugR5f2ZFKeO4xJtkKWJ9WXZ4vB8ivCO+B3jFFA2ArOtf93g/PDBM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :subject:to:message-id:date:mime-version:content-type; s= default; bh=ggSxcnMfs1OnqxHE+GjoCTnXJz0=; b=K3qysN7vTIFFpRd0Bk6W lQ6SrkvBHQL7O1OVWLx2qowR8bIYqGa+oO8VN5SS4oSLMK3woCkQSTb2x5DlUZno xsb/ORPTumpzlR+K9h5ZALlacBB6J9bhDITHg8mlqhxVvJDuka81k3PaqMNWAI/j 1qmrs0aK0NKC4O9Qjnh31eo= Received: (qmail 130836 invoked by alias); 19 Nov 2016 00:57:00 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 130821 invoked by uid 89); 19 Nov 2016 00:56:59 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_NONE, SPF_PASS, URIBL_RED autolearn=ham version=3.3.2 spammy=4816, 4708, 3358 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sat, 19 Nov 2016 00:56:49 +0000 Received: from svr-orw-mbx-01.mgc.mentorg.com ([147.34.90.201]) by relay1.mentorg.com with esmtp id 1c7txX-00012a-0p from Cesar_Philippidis@mentor.com for gcc-patches@gcc.gnu.org; Fri, 18 Nov 2016 16:56:47 -0800 Received: from [127.0.0.1] (147.34.91.1) by svr-orw-mbx-01.mgc.mentorg.com (147.34.90.201) with Microsoft SMTP Server (TLS) id 15.0.1210.3; Fri, 18 Nov 2016 16:56:44 -0800 From: Cesar Philippidis Subject: [gomp4] remove use of CUDA unified memory in libgomp To: "gcc-patches@gcc.gnu.org" Message-ID: <239c8d27-7b8f-130e-8e06-d2007053164c@codesourcery.com> Date: Fri, 18 Nov 2016 16:56:43 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 X-ClientProxiedBy: svr-orw-mbx-03.mgc.mentorg.com (147.34.90.203) To svr-orw-mbx-01.mgc.mentorg.com (147.34.90.201) This patch eliminates the use of CUDA unified shared memory via cuMemcpy inside nvptx_exec. The major problem with unified memory is that the CUDA driver needs to copy all of the host addresses to a special data region prior to transferring the data to and from the accelerator. I'm not sure why the use CUDA unified shared memory is necessary here because libgomp already has the necessary machinery to manage the data mappings itself directly. The only usage of CUDA unified memory occurs inside nvptx_exec. Specifically, that function uses 'dp', which is managed by the map_* functions, as a staging area to upload the omp struct arguments to the PTX kernel. This particular use of unified memory is particularly unfortunately as it imposes a synchronization barrier each time a PTX kernel is launched. At first I tried to eliminate those map* functions altogether, but that caused failures in asyncwait-1.c. This happens because the device memory used by async PTX kernel X may be overridden by async PTX kernel Y. Apparently, the map functions try to rectify this situation by implementing a pseudo circular queue in a paged sized block of memory. However, that queue implementation is fragile because map_pop 'frees' all of the allocations from the queue, not just the memory for the most recently used kernel. This patch does two things. First, it replaces the call to cuMemcpy inside nvptx_exec with cuMemcpyHtoD. Second, it teaches the map routines how to use a linked-list data structure to manage asynchronous PTX stream arguments. As an optimization map_init reserves a large block of memory for the first cuda_map so that non-async functions do not have to waste time allocating device memory. Additional cuda_maps are created for each async PTX stream with the requested SIZE as necessary. Because of the asynchronous nature, the map routines need to be guarded by a pthread lock. map_init and map_fini are already guarded by ptx_dev_lock, but map_push and map_pop need to be locked by ptx_event_lock. I don't really like the use of malloc for the cuda_maps, but that can be optimized with a different data structure later. I've applied this patch to gomp-4_0-branch. Cesar 2016-11-18 Cesar Philippidis libgomp/ * plugin/plugin-nvptx.c (struct cuda_map): New. (struct ptx_stream): Replace d, h, h_begin, h_end, h_next, h_prev, h_tail with (cuda_map *) map. (cuda_map_create): New function. (cuda_map_destroy): New function. (map_init): Update to use a linked list of cuda_map objects. (map_fini): Likewise. (map_pop): Likewise. (map_push): Likewise. Return CUdeviceptr instead of void. (init_streams_for_device): Remove stales references to ptx_stream members. (select_stream_for_async): Likewise. (nvptx_exec): Update call to map_init. diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index e4fcc0e..c435012 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -95,20 +95,20 @@ cuda_error (CUresult r) static unsigned int instantiated_devices = 0; static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER; +struct cuda_map +{ + CUdeviceptr d; + size_t size; + bool active; + struct cuda_map *next; +}; + struct ptx_stream { CUstream stream; pthread_t host_thread; bool multithreaded; - - CUdeviceptr d; - void *h; - void *h_begin; - void *h_end; - void *h_next; - void *h_prev; - void *h_tail; - + struct cuda_map *map; struct ptx_stream *next; }; @@ -120,101 +120,114 @@ struct nvptx_thread struct ptx_device *ptx_dev; }; +static struct cuda_map * +cuda_map_create (size_t size) +{ + struct cuda_map *map = GOMP_PLUGIN_malloc (sizeof (struct cuda_map)); + + assert (map); + + map->next = NULL; + map->size = size; + map->active = false; + + CUDA_CALL_ERET (NULL, cuMemAlloc, &map->d, size); + assert (map->d); + + return map; +} + +static void +cuda_map_destroy (struct cuda_map *map) +{ + CUDA_CALL_ASSERT (cuMemFree, map->d); + free (map); +} + +/* The following map_* routines manage the CUDA device memory that + contains the data mapping arguments for cuLaunchKernel. Each + asynchronous PTX stream may have multiple pending kernel + invocations, which are launched in a FIFO order. As such, the map + routines maintains a queue of cuLaunchKernel arguments. + + Calls to map_push and map_pop must be guarded by ptx_event_lock. + Likewise, calls to map_init and map_fini are guarded by + ptx_dev_lock inside GOMP_OFFLOAD_init_device and + GOMP_OFFLOAD_fini_device, respectively. */ + static bool map_init (struct ptx_stream *s) { int size = getpagesize (); assert (s); - assert (!s->d); - assert (!s->h); - - CUDA_CALL (cuMemAllocHost, &s->h, size); - CUDA_CALL (cuMemHostGetDevicePointer, &s->d, s->h, 0); - assert (s->h); + s->map = cuda_map_create (size); - s->h_begin = s->h; - s->h_end = s->h_begin + size; - s->h_next = s->h_prev = s->h_tail = s->h_begin; - - assert (s->h_next); - assert (s->h_end); return true; } static bool map_fini (struct ptx_stream *s) { - CUDA_CALL (cuMemFreeHost, s->h); + assert (s->map->next == NULL); + assert (!s->map->active); + + cuda_map_destroy (s->map); + return true; } static void map_pop (struct ptx_stream *s) { - assert (s != NULL); - assert (s->h_next); - assert (s->h_prev); - assert (s->h_tail); - - s->h_tail = s->h_next; - - if (s->h_tail >= s->h_end) - s->h_tail = s->h_begin + (int) (s->h_tail - s->h_end); + struct cuda_map *next; - if (s->h_next == s->h_tail) - s->h_prev = s->h_next; + assert (s != NULL); - assert (s->h_next >= s->h_begin); - assert (s->h_tail >= s->h_begin); - assert (s->h_prev >= s->h_begin); + if (s->map->next == NULL) + { + s->map->active = false; + return; + } - assert (s->h_next <= s->h_end); - assert (s->h_tail <= s->h_end); - assert (s->h_prev <= s->h_end); + next = s->map->next; + cuda_map_destroy (s->map); + s->map = next; } -static void -map_push (struct ptx_stream *s, size_t size, void **h, void **d) +static CUdeviceptr +map_push (struct ptx_stream *s, size_t size) { - int left; - int offset; + struct cuda_map *map = NULL, *t = NULL; - assert (s != NULL); + assert (s); + assert (s->map); - left = s->h_end - s->h_next; + /* Each PTX stream requires a separate data region to store the + launch arguments for cuLaunchKernel. Allocate a new + cuda_map and push it to the end of the list. */ + if (s->map->active) + { + map = cuda_map_create (size); - assert (s->h_prev); - assert (s->h_next); + for (t = s->map; t->next != NULL; t = t->next) + ; - if (size >= left) + t->next = map; + } + else if (s->map->size < size) { - assert (s->h_next == s->h_prev); - s->h_next = s->h_prev = s->h_tail = s->h_begin; + cuda_map_destroy (s->map); + map = cuda_map_create (size); } + else + map = s->map; - assert (s->h_next); - - offset = s->h_next - s->h; - - *d = (void *)(s->d + offset); - *h = (void *)(s->h + offset); - - s->h_prev = s->h_next; - s->h_next += size; - - assert (s->h_prev); - assert (s->h_next); - - assert (s->h_next >= s->h_begin); - assert (s->h_tail >= s->h_begin); - assert (s->h_prev >= s->h_begin); - assert (s->h_next <= s->h_end); - assert (s->h_tail <= s->h_end); - assert (s->h_prev <= s->h_end); + s->map = map; + s->map->active = true; - return; + return s->map->d; } /* Target data function launch information. */ @@ -335,8 +348,6 @@ init_streams_for_device (struct ptx_device *ptx_dev, int concurrency) null_stream->stream = NULL; null_stream->host_thread = pthread_self (); null_stream->multithreaded = true; - null_stream->d = (CUdeviceptr) NULL; - null_stream->h = NULL; if (!map_init (null_stream)) return false; @@ -470,8 +481,6 @@ select_stream_for_async (int async, pthread_t thread, bool create, s->host_thread = thread; s->multithreaded = false; - s->d = (CUdeviceptr) NULL; - s->h = NULL; if (!map_init (s)) { pthread_mutex_unlock (&ptx_dev->stream_lock); @@ -889,7 +898,8 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, int i; struct ptx_stream *dev_str; void *kargs[1]; - void *hp, *dp; + void *hp; + CUdeviceptr dp; struct nvptx_thread *nvthd = nvptx_thread (); const char *maybe_abort_msg = "(perhaps abort was called)"; @@ -999,17 +1009,20 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, /* This reserves a chunk of a pre-allocated page of memory mapped on both the host and the device. HP is a host pointer to the new chunk, and DP is the corresponding device pointer. */ - map_push (dev_str, mapnum * sizeof (void *), &hp, &dp); + pthread_mutex_lock (&ptx_event_lock); + dp = map_push (dev_str, mapnum * sizeof (void *)); + pthread_mutex_unlock (&ptx_event_lock); GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); /* Copy the array of arguments to the mapped page. */ + hp = alloca(sizeof(void *) * mapnum); for (i = 0; i < mapnum; i++) ((void **) hp)[i] = devaddrs[i]; /* Copy the (device) pointers to arguments to the device (dp and hp might in fact have the same value on a unified-memory system). */ - CUDA_CALL_ASSERT (cuMemcpy, (CUdeviceptr) dp, (CUdeviceptr) hp, + CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, hp, mapnum * sizeof (void *)); GOMP_PLUGIN_debug (0, " %s: kernel %s: launch" " gangs=%u, workers=%u, vectors=%u\n",