diff mbox

RFC: shared-memory synchronization benchmarking in glibc

Message ID 1484061478.5606.205.camel@redhat.com
State New
Headers show

Commit Message

Torvald Riegel Jan. 10, 2017, 3:17 p.m. UTC
I'd like to get feedback on how we build glibc's synchronization-related
microbenchmarks, in particular regarding two aspects: (1) How to best
emulate real-world workloads and (2) how to best fit them into the
existing microbenchmark suite.

Attached is what I've been playing with, which uses this for rwlock
benchmarking.


=== Workloads

As you can see in the attached patch, I'm using random accesses to both
a thread-shared and a thread-private memory region to simulate work done
between calls to synchronization primitives.  For example, in the rwlock
case, accesses to thread-private data between reader critical sections
and accesses to the shared data within critical sections.  So, "work" is
represented as invoking a Marsaglia-XOR-RNG followed by a memory
accesses for a certain number of times.

This aims at giving the memory system some more stuff to do than if we
were just, for example, spinning until a certain number of CPU cycles
has passed.  Also, it allows one to introduce (cache) misses by varying
the parameters.

Is this a reasonable way of emulating real-world workloads for
concurrent code, from the perspective of your hardware?
Any suggestions for how to improve this?
Any suggestions regarding scenarios that should be covered?
I'm looking for both scenarios that can significantly affect
synchronization performance (eg, stressing the memory system more so
that cache misses are harder to hide performance-wise) as well as
scenarios that are representative of common real-world workloads (eg,
should there be more dependencies such as would arise when navigating
through linked data structures?).
What do you use in your own testing?

I'd also be interested in seeing microbenchmarks that show the
assumptions we make.  For example, it would be nice to show cases in
which the acquire hints on Power result in a performance gain, and where
not (so the rest of the glibc developers get a better idea where to use
the hint and where not to use it).

A first target for improved benchmarking would be lock elision
performance, I believe.


=== Making it fit into the microbenchmark framework

I'd like to hear about ideas and suggestions for how to best integrate
this into the microbenchmark suite.  The difference to many of the
current tests is that it's not sufficient to just run a function in a
tight loop.  We need to look at many more workloads (eg, long vs. short
critical sections, different numbers of threads, ...).

That means we need to collect a lot more data, and present it in a
better way.  In the past (for another project), I've used a simple
mongodb database to store json objects representing benchmark results,
and then ran queries across that fed into visualization (eg, using
gnuplot).  Is this something that somebody else is already working on?

Doing that would mean collecting more information about when/where/... a
benchmark run happened.  Any thoughts on that?

It would also be useful if benchmarks were easier to run manually so
that developers (and users) can experiment more easily with different
workloads.  However, that would mean that we would need to be able to
set benchmark parameters when invoking a benchmark, and not just be able
to produce a benchmark binary that has them baked in.  Any thoughts or
preferences how to best do that?

Comments

Adhemerval Zanella Netto Jan. 27, 2017, 7:49 p.m. UTC | #1
Reviving this thread because I think we can really benefit from this kind
of workload.

On 10/01/2017 13:17, Torvald Riegel wrote:
> I'd like to get feedback on how we build glibc's synchronization-related

> microbenchmarks, in particular regarding two aspects: (1) How to best

> emulate real-world workloads and (2) how to best fit them into the

> existing microbenchmark suite.


I think we will have multiple different workloads and even for some
architectures will focus on some aspect of the performance. So imho focusing
on creating a workload generator, in the same way fio [1] does for IO,
could make us create different kind of synthetic synchronization workloads.

[1] https://github.com/axboe/fio

> 

> Attached is what I've been playing with, which uses this for rwlock

> benchmarking.

> 

> 

> === Workloads

> 

> As you can see in the attached patch, I'm using random accesses to both

> a thread-shared and a thread-private memory region to simulate work done

> between calls to synchronization primitives.  For example, in the rwlock

> case, accesses to thread-private data between reader critical sections

> and accesses to the shared data within critical sections.  So, "work" is

> represented as invoking a Marsaglia-XOR-RNG followed by a memory

> accesses for a certain number of times.

> 

> This aims at giving the memory system some more stuff to do than if we

> were just, for example, spinning until a certain number of CPU cycles

> has passed.  Also, it allows one to introduce (cache) misses by varying

> the parameters.


I think for an initial synthetic workload it should give us some meaningful
data. Ideally I would like to have a way to select more different kind of 
workload, with or without different memory accesses.

> 

> Is this a reasonable way of emulating real-world workloads for

> concurrent code, from the perspective of your hardware?

> Any suggestions for how to improve this?

> Any suggestions regarding scenarios that should be covered?

> I'm looking for both scenarios that can significantly affect

> synchronization performance (eg, stressing the memory system more so

> that cache misses are harder to hide performance-wise) as well as

> scenarios that are representative of common real-world workloads (eg,

> should there be more dependencies such as would arise when navigating

> through linked data structures?).

> What do you use in your own testing?


I think we first should define exactly why kind of metric we are aiming
here. Are we measuring synchronization latency for an specific implementation
(semaphores, pthread r{d,w}lock) or just atomic operations? Are we aiming
for high or low contention? How many threads compared to cpuset? Critical
section that are cpu intensive or memory intensive? What about memory
topology and its different latencies?

That's why I think aiming to create a configurable workload generator
we can then get different metrics, such as a workload with high thread
creation that updates atomic variables or poll of threads with different
thread set size.

> 

> I'd also be interested in seeing microbenchmarks that show the

> assumptions we make.  For example, it would be nice to show cases in

> which the acquire hints on Power result in a performance gain, and where

> not (so the rest of the glibc developers get a better idea where to use

> the hint and where not to use it).

> 

> A first target for improved benchmarking would be lock elision

> performance, I believe.

> 

> 

> === Making it fit into the microbenchmark framework

> 

> I'd like to hear about ideas and suggestions for how to best integrate

> this into the microbenchmark suite.  The difference to many of the

> current tests is that it's not sufficient to just run a function in a

> tight loop.  We need to look at many more workloads (eg, long vs. short

> critical sections, different numbers of threads, ...).

> 

> That means we need to collect a lot more data, and present it in a

> better way.  In the past (for another project), I've used a simple

> mongodb database to store json objects representing benchmark results,

> and then ran queries across that fed into visualization (eg, using

> gnuplot).  Is this something that somebody else is already working on?


I do not have much experience on this kind of programs, but wouldn't a
simpler solution such tinydb [1] suffice to store an query the required
data? I thinking that requiring a full db deployment to run these
benchmarks is something too complex.

[1] https://pypi.python.org/pypi/tinydb

> 

> Doing that would mean collecting more information about when/where/... a

> benchmark run happened.  Any thoughts on that?


I think collecting the expected default information (kernel, compiler,
cpu) plus the topology from lstopo should be suffice.

> 

> It would also be useful if benchmarks were easier to run manually so

> that developers (and users) can experiment more easily with different

> workloads.  However, that would mean that we would need to be able to

> set benchmark parameters when invoking a benchmark, and not just be able

> to produce a benchmark binary that has them baked in.  Any thoughts or

> preferences how to best do that?


I think configurable workloads in a form of either ini or json file
should good enough.
Stefan Liebler Feb. 15, 2017, 4:26 p.m. UTC | #2
On 01/27/2017 08:49 PM, Adhemerval Zanella wrote:
> Reviving this thread because I think we can really benefit from this kind

> of workload.

>

> On 10/01/2017 13:17, Torvald Riegel wrote:

>> I'd like to get feedback on how we build glibc's synchronization-related

>> microbenchmarks, in particular regarding two aspects: (1) How to best

>> emulate real-world workloads and (2) how to best fit them into the

>> existing microbenchmark suite.

Ondřej Bílka wrote the dryrun framework (see post "Moving dryrun as 
separate project.", 
https://sourceware.org/ml/libc-alpha/2015-08/msg00400.html) which 
collects data of called string functions in real applications.
Perhaps an extended version could collect timing information of calls to 
synchronization functions.
Those information could be used to extract parameters for existing 
benchmarks or could be used to replay them.
This way, somebody could contribute real workload information.
>

> I think we will have multiple different workloads and even for some

> architectures will focus on some aspect of the performance. So imho focusing

> on creating a workload generator, in the same way fio [1] does for IO,

> could make us create different kind of synthetic synchronization workloads.

>

> [1] https://github.com/axboe/fio

>

A workload generator sounds really good.
But perhaps we can start with a common framework which enables you to 
easily create a concrete benchmark without dealing with setting up 
threads and an alarm timer, ... .
I've found an older post from Paul E. Murphy:
"[RFC] benchtest: Add locking microbenchmarks"
(https://sourceware.org/ml/libc-alpha/2015-12/msg00540.html)
"It runs
a workload for a fixed amount of time and counts the number
of iterations achieved (throughput). Test workloads (TFUNC)
are tested against each locking function (LFUNC)."

In future a workload generator could use those existing "LFUNC" / 
"TFUNC"'s to generate a specific workload.
Somebody could describe benchmarks where one thread uses a specific 
L/TFUNC and other threads are using different ones. This could be useful 
for creating bechmarks with rwlock_rdlock/rwlock_wrlock or lock/trylock.
>>

>> Attached is what I've been playing with, which uses this for rwlock

>> benchmarking.

>>

The struct shared_data_t contains pad arrays with 128 bytes.
Do you want to force the member rwlock into a separate cache line?
Then there should be the possibility for an architecture to specify
the size of a cache-line.
E.g. on s390 the cache line size is 256 bytes.
Perhaps this information can be read at runtime via 
sysconf(_SC_LEVEL1_DCACHE_LINESIZE) - if available.

max_threads could be determined dynamically, too?

I've saw the usage of __atomic_load_n, __atomic_store_n.
Shall we use the macros from include/atomic.h?

Would it make sense to call the used synchronisation function once 
before timing or link with -z now?
Then we could omit the overhead of _dl_runtime_resolve while calling the 
function for the first time.
>>

>> === Workloads

>>

>> As you can see in the attached patch, I'm using random accesses to both

>> a thread-shared and a thread-private memory region to simulate work done

>> between calls to synchronization primitives.  For example, in the rwlock

>> case, accesses to thread-private data between reader critical sections

>> and accesses to the shared data within critical sections.  So, "work" is

>> represented as invoking a Marsaglia-XOR-RNG followed by a memory

>> accesses for a certain number of times.

>>

>> This aims at giving the memory system some more stuff to do than if we

>> were just, for example, spinning until a certain number of CPU cycles

>> has passed.  Also, it allows one to introduce (cache) misses by varying

>> the parameters.

>

> I think for an initial synthetic workload it should give us some meaningful

> data. Ideally I would like to have a way to select more different kind of

> workload, with or without different memory accesses.

>

>>

>> Is this a reasonable way of emulating real-world workloads for

>> concurrent code, from the perspective of your hardware?

>> Any suggestions for how to improve this?

>> Any suggestions regarding scenarios that should be covered?

>> I'm looking for both scenarios that can significantly affect

>> synchronization performance (eg, stressing the memory system more so

>> that cache misses are harder to hide performance-wise) as well as

>> scenarios that are representative of common real-world workloads (eg,

>> should there be more dependencies such as would arise when navigating

>> through linked data structures?).

>> What do you use in your own testing?

>

> I think we first should define exactly why kind of metric we are aiming

> here. Are we measuring synchronization latency for an specific implementation

> (semaphores, pthread r{d,w}lock) or just atomic operations? Are we aiming

> for high or low contention? How many threads compared to cpuset? Critical

> section that are cpu intensive or memory intensive? What about memory

> topology and its different latencies?

>

> That's why I think aiming to create a configurable workload generator

> we can then get different metrics, such as a workload with high thread

> creation that updates atomic variables or poll of threads with different

> thread set size.

>

>>

>> I'd also be interested in seeing microbenchmarks that show the

>> assumptions we make.  For example, it would be nice to show cases in

>> which the acquire hints on Power result in a performance gain, and where

>> not (so the rest of the glibc developers get a better idea where to use

>> the hint and where not to use it).

>>

>> A first target for improved benchmarking would be lock elision

>> performance, I believe.

>>

>>

>> === Making it fit into the microbenchmark framework

>>

>> I'd like to hear about ideas and suggestions for how to best integrate

>> this into the microbenchmark suite.  The difference to many of the

>> current tests is that it's not sufficient to just run a function in a

>> tight loop.  We need to look at many more workloads (eg, long vs. short

>> critical sections, different numbers of threads, ...).

>>

>> That means we need to collect a lot more data, and present it in a

>> better way.  In the past (for another project), I've used a simple

>> mongodb database to store json objects representing benchmark results,

>> and then ran queries across that fed into visualization (eg, using

>> gnuplot).  Is this something that somebody else is already working on?

This means you could visualize the data of one benchmark-run and plot a 
graph with long vs. short, ...
and / or such graphs of multiple benchmark-runs (e.g. different glibc 
versions)?
>

> I do not have much experience on this kind of programs, but wouldn't a

> simpler solution such tinydb [1] suffice to store an query the required

> data? I thinking that requiring a full db deployment to run these

> benchmarks is something too complex.

>

> [1] https://pypi.python.org/pypi/tinydb

>

>>

>> Doing that would mean collecting more information about when/where/... a

>> benchmark run happened.  Any thoughts on that?

Do you plan to collect those benchmark data on a public server or on 
somebody's local machine?
>

> I think collecting the expected default information (kernel, compiler,

> cpu) plus the topology from lstopo should be suffice.

>

Perhaps we could add the commit-ID of the used glibc-build.
>>

>> It would also be useful if benchmarks were easier to run manually so

>> that developers (and users) can experiment more easily with different

>> workloads.  However, that would mean that we would need to be able to

>> set benchmark parameters when invoking a benchmark, and not just be able

>> to produce a benchmark binary that has them baked in.  Any thoughts or

>> preferences how to best do that?

>

> I think configurable workloads in a form of either ini or json file

> should good enough.

>
diff mbox

Patch

commit 7554bd4a042d6dda0ec8d34224682c4b9124a8bd
Author: Torvald Riegel <triegel@redhat.com>
Date:   Tue Jan 10 13:27:44 2017 +0100

    Add synchronization microbenchmark.

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 81edf8a..60aef3d 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -25,7 +25,7 @@  bench-math := acos acosh asin asinh atan atanh cos cosh exp exp2 log log2 \
 	      modf pow rint sin sincos sinh sqrt tan tanh fmin fmax fminf \
 	      fmaxf
 
-bench-pthread := pthread_once
+bench-pthread := pthread_once multithread
 
 bench-string := ffs ffsll
 
diff --git a/benchtests/bench-multithread.c b/benchtests/bench-multithread.c
new file mode 100644
index 0000000..633deb0
--- /dev/null
+++ b/benchtests/bench-multithread.c
@@ -0,0 +1,319 @@ 
+/* Benchmark multithreaded operations.
+   Copyright (C) 2013-2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define __USE_POSIX
+#include <errno.h>
+#include <math.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <unistd.h>
+#include <semaphore.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <signal.h>
+
+#include "bench-timing.h"
+//#include "json-lib.h"
+
+/* Benchmark parameters.  */
+static const int duration = 5;
+static const int max_threads = 4;
+/* Must be power-of-two values.  */
+static const size_t mem_shared_kb = 1;
+static const size_t mem_private_mb = 16;
+
+/* Thread data.  */
+typedef struct {
+  bool primary;
+  sem_t work_ready;
+  size_t mem_private_elems;
+  unsigned int *mem_private;
+  size_t mem_shared_elems;
+  unsigned int *mem_shared;
+  long int iter;
+  uint32_t rng_state;
+  bool work_done;
+} thread_data_t;
+
+/* Shared data.  */
+typedef struct {
+  size_t mem_elems;
+  unsigned int *mem;
+  char pad1[128];
+  pthread_rwlock_t rwlock;
+  char pad2[128];
+} shared_data_t;
+
+shared_data_t shared_data;
+sem_t threads_ready;
+bool finished;
+
+
+typedef struct {
+  pthread_t thread;
+  thread_data_t* td;
+} thread_handle_t;
+thread_handle_t* threads;
+
+void* xmalloc (size_t s)
+{
+  void *ptr = malloc (s);
+  if (ptr == NULL)
+    {
+      fprintf (stderr, "malloc failed\n");
+      exit (1);
+    }
+  return ptr;
+}
+
+static uint32_t pseudo_rng (uint32_t *state)
+{
+  *state ^= *state << 13;
+  *state ^= *state >> 17;
+  *state ^= *state << 5;
+  return *state;
+}
+
+static void shared_work_start (void)
+{
+  pthread_rwlockattr_t attr;
+  pthread_rwlockattr_init (&attr);
+  pthread_rwlockattr_setpshared (&attr, PTHREAD_PROCESS_PRIVATE);
+  pthread_rwlockattr_setkind_np (&attr, PTHREAD_RWLOCK_PREFER_READER_NP);
+  pthread_rwlock_init (&shared_data.rwlock, &attr);
+  pthread_rwlockattr_destroy (&attr);
+}
+
+static void shared_work_stop (void)
+{
+  pthread_rwlock_destroy (&shared_data.rwlock);
+}
+
+/* If writes is 0, reads_per_write reads are run once nonetheless.  */
+static void access_mem (unsigned int *mem, size_t size,
+    size_t writes, size_t reads_per_write, uint32_t *rng)
+{
+  do
+    {
+      unsigned int sum = 0;
+      for (int j = 0; j < reads_per_write; j++)
+	{
+	  size_t idx = pseudo_rng (rng) & (size - 1);
+	  sum += __atomic_load_n (mem + idx, __ATOMIC_RELAXED);
+	}
+      if (writes > 0)
+	{
+	  size_t idx = pseudo_rng (rng) & (size - 1);
+	  __atomic_store_n (mem + idx, sum, __ATOMIC_RELAXED);
+	  writes--;
+	}
+    }
+  while (writes > 0);
+}
+
+static void do_work (thread_data_t *td)
+{
+  while (!__atomic_load_n (&td->work_done, __ATOMIC_RELAXED))
+    {
+      access_mem (td->mem_private, td->mem_private_elems, 0, 60,
+	  &td->rng_state);
+
+      pthread_rwlock_rdlock (&shared_data.rwlock);
+      access_mem (td->mem_shared, td->mem_shared_elems, 3, 20,
+	  &td->rng_state);
+      pthread_rwlock_unlock (&shared_data.rwlock);
+
+      td->iter++;
+    }
+}
+
+static void* thread_func (void* tdptr)
+{
+  /* Allocate data in the thread so that we use local memory.  */
+  thread_data_t *td = xmalloc (sizeof (thread_data_t));
+  *((thread_data_t**) tdptr) = td;
+  td->primary = (&threads[0].td == tdptr);
+  sem_init (&td->work_ready, 0, 0);
+  td->mem_private_elems = mem_private_mb * 1024 * 1024
+      / sizeof (*td->mem_private);
+  td->mem_private = xmalloc (sizeof (*td->mem_private)
+      * td->mem_private_elems);
+  td->mem_shared_elems = shared_data.mem_elems;
+  td->mem_shared = shared_data.mem;
+  td->rng_state = 1692732839;
+
+  if (td->primary)
+    {
+      /* Wait for all threads to complete initialization.  */
+      for (int i = 1; i < max_threads; i++)
+	while (sem_wait (&threads_ready) != 0) ;
+
+      /* Run all the work.  */
+      for (int t = 1; t <= max_threads; t++)
+	{
+	  // TODO update work description
+	  shared_work_start ();
+	  timing_t start, stop, real_duration;
+
+	  /* Tell participating threads to start and start our own work.  */
+	  TIMING_NOW (start);
+	  for (int i = 1; i < t; i++)
+	    {
+	      threads[i].td->work_done = false;
+	      sem_post (&threads[i].td->work_ready);
+	    }
+
+	  /* Start the alarm and our own work.  */
+	  td->work_done = false;
+	  td->iter = 0;
+	  alarm (duration);
+	  do_work (td);
+	  TIMING_NOW (stop);
+
+	  /* Wait till we're done, and gather results.  */
+	  long int iter = td->iter;
+	  for (int i = 1; i < t; i++)
+	    {
+	      while (sem_wait (&threads_ready) != 0) ;
+	      iter += threads[i].td->iter;
+	    }
+
+	  TIMING_DIFF (real_duration, start, stop);
+	  printf("threads=%d  iter/s=% 15ld  timing_t=%.0lf\n", t,
+	      iter / duration, (double)real_duration);
+
+	  shared_work_stop ();
+	}
+      /* We're done.  Tell all others.  */
+      finished = true;
+      for (int i = 1; i < max_threads; i++)
+	sem_post (&threads[i].td->work_ready);
+    }
+  else
+    {
+      /* We are ready.  Do what we're told to do.  */
+      while (1)
+	{
+	  sem_post (&threads_ready);
+	  while (sem_wait (&td->work_ready) != 0) ;
+	  if (finished)
+	    break;
+	  td->iter = 0;
+	  do_work (td);
+	}
+    }
+  return NULL;
+}
+
+static void setup_threads (void)
+{
+  threads = xmalloc (sizeof (thread_handle_t*) * max_threads);
+  sem_init (&threads_ready, 0, 0);
+  threads[0].thread = pthread_self ();
+  for (int i = 1; i < max_threads; i++)
+    {
+      if (pthread_create (&threads[i].thread, NULL, thread_func,
+	  &threads[i].td) != 0)
+	{
+	  fprintf (stderr, "pthread_create failed\n");
+	  exit (1);
+	}
+    }
+}
+
+
+static void
+alarm_handler (int signum)
+{
+  for (int i = 0; i < max_threads; i++)
+    __atomic_store_n (&threads[i].td->work_done, true, __ATOMIC_RELAXED);
+}
+
+int
+main (int argc, char **argv)
+{
+//  json_ctx_t json_ctx;
+  struct sigaction act;
+
+  if (!__atomic_always_lock_free (sizeof (unsigned int), 0))
+    {
+      printf ("no fast atomics available\n");
+      exit (1);
+    }
+  if (((mem_shared_kb & (mem_shared_kb - 1)) != 0)
+      || ((mem_private_mb & (mem_private_mb - 1)) != 0))
+    {
+      printf ("mem sizes are not power-of-two values\n");
+      exit (1);
+    }
+
+//  json_init (&json_ctx, 0, stdout);
+//
+//  json_document_begin (&json_ctx);
+//
+//  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+//
+//  json_attr_object_begin (&json_ctx, "functions");
+//
+//  json_attr_object_begin (&json_ctx, "malloc");
+//
+//  json_attr_object_begin (&json_ctx, "");
+
+
+  memset (&act, 0, sizeof (act));
+  act.sa_handler = &alarm_handler;
+
+  sigaction (SIGALRM, &act, NULL);
+
+  // TODO what's res?
+  unsigned long res;
+  TIMING_INIT (res);
+  (void) res;
+
+  shared_data.mem_elems = mem_shared_kb * 1024 / sizeof (*shared_data.mem);
+  shared_data.mem = xmalloc (sizeof (*shared_data.mem)
+      * shared_data.mem_elems);
+
+  setup_threads();
+  thread_func (&threads[0].td);
+
+//  json_attr_double (&json_ctx, "duration", d_total_s);
+//  json_attr_double (&json_ctx, "iterations", d_total_i);
+//  json_attr_double (&json_ctx, "time_per_iteration", d_total_s / d_total_i);
+//  json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss);
+//
+//  json_attr_double (&json_ctx, "threads", num_threads);
+//  json_attr_double (&json_ctx, "min_size", MIN_ALLOCATION_SIZE);
+//  json_attr_double (&json_ctx, "max_size", MAX_ALLOCATION_SIZE);
+//  json_attr_double (&json_ctx, "random_seed", RAND_SEED);
+//
+//  json_attr_object_end (&json_ctx);
+//
+//  json_attr_object_end (&json_ctx);
+//
+//  json_attr_object_end (&json_ctx);
+//
+//  json_document_end (&json_ctx);
+
+  return 0;
+}