Delft-Hydra 2004 API Specification

version 4.0

Irv Elshoff @ WL | Delft Hydraulics
21 jun 04

Introduction
Execution Overview
Names and IDs
Categories
Blobs
Iterators
Clusters
Debug Facility
Developer Notes

Introduction

The Hydra executive provides parallelism, synchonization and message passing for domain decomposition (DD) and coupling in Delft3D. This document describes the semantics and C++ bindings of Hydra for Delft3D developers.

Hydra supports a collection of interacting objects called iterators. Each iterator repeatedly does some work (e.g., computation or I/O) and then exchanges information with a fixed set of neighboring iterators. They continue doing this until they all decide to stop. (A single iterator can force termination if an error occurs.) The work an iterator does is specified by its iterator function, which is supplied by the Delft3D developer.

Iterators are classified according to some essential common property and organized into categories. The neighbors of an iterator may not belong to same category as the iterator itself.

Consider the following undirected multipartite graph:

The blue nodes could represent hydrodynamic flow processes (one for each each subdomain), the red nodes DD mappers and the yellow node an online visualization process that renders a coherent view of the entire domain to an end user. Interactions between the flow processes and visualizer could occur every simulation time step, whereas interactions between flow processes and mappers could be more frequent within a numerical solver loop.

All of the iterators in a given category can run in parallel. At any moment in time, iterators of only one category are executing.

Because not all iterators can run simultaneously and the number of iterators in a category may exceed the number physical processors available, Hydra provides clusters to group iterators together on a single processor so that efficient use is made of the underlying hardware. Clusters also define a common address space so that shared memory can be used for communication.

Communication between clusters is done by message passing. Messages are stored in blobs, which are opaque to Hydra.

Execution Overview

Hydra goes through four phases during execution:

Configuration: Instantiation of categories, clusters and iterators.
Initialization: Sequential execution the iterator functions up to a certain point.
Simulation: Parallel execution of iterators with message exchange. This is the main phase of the program.
Finalization: Available for non-Hydra wrapup.

Hydra starts out as a single thread. The Hydra executive must be bootstrapped before any other operations with:

void Hydra::Boot (
    void (*configfunction) (int argc, char * argv[])
    void (*finalfunction) (char * reason)
    );

The arguments to Boot are pointers to two user-defined functions. The first function should configure Hydra by creating instances of iterators, categories and clusters. It is given a copy of the command-line arguments used to start the program. Some parameters may be relevant (e.g., the name of a configuration file). Unrecognized parameters should be ignored; they may be meaningful to Hydra itself or other components of the program.

The second function is called after the simulation phase completes, and can be used to clean up loose ends (e.g., release allocated resources). Normally, it is passed a NULL pointer, unless the program has aborted, in which case it is passed a pointer to a character string describing the reason for the abort (see below). The finalization function is optional; NULL may be passed to Boot if it not present.

The configuration and finalization functions will run in the same address space. They do not, however, have access to the address spaces of the iterators.

After the configuration function returns, Hydra will sequentially invoke all iterator functions within their clusters. They can initialize their internal states, make Hydra calls to determine the configuration topology, and prepare for simulation. Code in the initialization phase cannot use pointers to objects created in the configuration phase because different address spaces are used. When their initialization has been completed, iterator functions must call:

void Hydra::Ready (
void
);

When all iterator functions have called Ready, the initialization phase is concluded and the simulation proper is started. In this phase iterators will be allow to run in parallel, and the simulation proceeds until all iterators have returned from their function.

If an error occurs during any phase of the program that makes continuing on pointless, the entire program can be terminated immediately with:

void Hydra::Abort (
    char * reason,
    ...
    );

Reason is a character string that will be passed to the finalization function and can be used to describe why the simulation was aborted. Reason may be a printf format string with additional arguments, in which case the string will be rendered before being passed to the finalization function.

During the simulation phase iterators synchronize and exchange information using Send and Receive calls. The general execution paradigm used in Delft3D is that an iterator repeatedly sends messages to all its neighbors (in a certain category) and then receives messages from all neighbors. A iterator is effectly blocked until all its neighbors have signalled it to continue. This causes a systolic execution pattern.

At the very end, when all iterators have terminated, the finalization function defined in the Initialize call will be invoked. The reason argument is that supplied to Abort, or NULL for normal termination.

Names and IDs

Hydra is object oriented. Three object classes - categories, clusters and iterators - have three different kinds of identifiers:

A character string name for user-oriented output (max length is Hydra::MAXSTRING).
An integer ID for comparison purposes.
A C++ pointer for local invocation. Local means within the same program phase and cluster.

Names and IDs are global and available in all phases of the program. All three classes have methods to retrieve the name and ID given a pointer:

char * name = myobject->Name (
    void
    );

int id = myobject->ID (
    void
    );

An ID is a nonnegative integer less than Hydra::MAXCATEGORIES, Hydra::MAXCLUSTERS and Hydra::MAXITERATORS, respectively.

Blobs are anonymous and do not have these methods.

Blobs

Besides synchronization, Hydra provides a means of exchanging data. The organization and content of the data is arbitrary and opaque to Hydra. To Hydra, data is nothing more than a contiguous block of memory of a certain size, and is called a blob. Hydra passes copies of the data around as appropriate. Blobs are used in iterator initialization and for message passing.

To create a blob, call:

HydraBlob * myblob = new Hydra::Blob (
    const void * address,
    const unsigned int size
    );

This will create a small descriptor containing a copy of the address and size (in bytes). The data itself is not copied or otherwise accessed. The address should point to statically or dynamically allocated memory that remains valid for the lifetime of the blob.

If the blob is used as an argument for Send or Receive (see Messages below) the data referred to in the descriptor is copied as appropriate (from the adress for Send, to the address for Receive).

Blobs provide two methods to retrieve the instantiation parameters:

void * address = blob->Address (
    void
    );

unsigned int size = blob->Size (
    void
    );

Address returns a pointer to the blob data. Size returns its length in bytes.

Iterators

Configuration Phase

An iterator is an envelope for an active object. Iterators are created in the configuration phase (only) with:

Iterator * myiter = new Hydra::Iterator (
    int * id,
    char * name,
    Blob * configblob,
    Category * category,
    void (*function) (
        Iterator * self,
        char * name,
        Blob * configblob
        ),
    unsigned int weight = 0
);

The ID argument is a point to an integer that will be set to the ID of the new iterator. If the pointer is NULL, no assignment will be done.

Iterators can be parameterized by means of a blob, which could contain a string or a complex data structure. If parameterization is not necessary, a NULL pointer should be given.

Iterators have weights, which are used to automatically cluster itertators for load balancing (see Clusters).

The iterator function is invoked by Hydra to carry out the actual work. The self argument is a pointer to the iterator object for various Hydra calls described below. The name and configuration blob parameters are copies of the arguments to the object instantiation call.

Iterators are joined with:

void Hydra::Join (
    Iterator * iter1,
    Iterator * iter2,
    unsigned int affinity = 0
    );

Join creates a symmetic relationship between two iterators. Joining A and B is the same as joining B and A. Join is idempotent, so doing both joins would result in a single relationship (and no error or warning).

The categories of both iterators must be different. Iterators that belong to the same category cannot be joined.

The affinity is used for load balancing, which is described later. If two iterators are joined more than once with varying affinities, the last affinity given will be used.

Hydra runs until all iterators have terminated. Sometimes it is convenient to write iterators and repeatedly receive a message an indefinite number of times. To avoid requiring user code to shut down all iterators for global termination, an iterator may "detach" itself so that its termination is not required for global termination. Iterators can be detached with:

void iterator->Detach (
void
);

Initialization and Simulation Phases

An iterator function can call various Hydra functions to get information about the topology and attributes of other iterators. Since the topology is static, these calls will always return the same information; the return values can be remembered and used later.

Iterator * self = IteratorSelf (
void
);

Returns an object pointer for the currently executing iterator.

Blob * configblob = iterator->ConfigBlob (
void
);

Returns a reference to the configuration blob of an iterator.

Category * cat = iterator->Category (
void
);

Returns the category of any iterator.

unsigned int count = iterator->NeighborCount (
Category * cat = NULL
);

Returns the number of neighbors in a given category or all categories (no argument) of an iterator.

void iterator->RewindNeighbors (
    Category * cat = NULL
    );

Iterator * neighbor = iterator->NextNeighbor (
    Category * cat = NULL
    );

These functions allow looping through a list of neighboring iterators. The Rewind function resets the list. The Next function fetches the next iterator on the list. The order is arbitrary but consistent. When the list is exhausted NULL is returned. Like in the Count function, the category can be omitted to get all neighbors regardless of their category. This is convenient when an iterator is known to have only one kind of neighbor. If a category is specified, the pointer should be the result of a LookupCategory done in the same program phase.

The Rewind function must be called at least once before Next with the same argument. Rewind will print a warning if the iterator does not have any neighbors of the requested category. This can be avoided by ensure that the return value of Count is not zero.

Hydra provides an iterator-specific value that can be set and retrieved with:

void iterator->SetValue (
    void * value
    );

void * value = iterator->GetValue (
    void
    );

The value can be considered a global variable for all functions called from the iterator function. Each iterator will have its own actual value. If more then one value per iterator is required, an aggregate data structure can be created and a pointer to it used as the single iterator value.

Messages

Hydra iterators can exchange messages with immediate neighbors. This is done with:

void neighbor->Send (
Blob * message
);

This function copies the blob to a queue of messages for the specified iterator and returns. The blob can be deleted or reused for an identically sized message immediately after Send returns.

void neighbor->Receive (
Blob * message
);

This gets the oldest message on the queue sent from the specified neighbor. There is no wildcard; the neighbor must be explicitly mentioned. If the queue is empty, Receive will block until a message is Sent the neighbor. The blob should be large enough to accomodate the message (otherwise the program will abort). The maximum blob size is just short of 4 GB (32-bit address size limitation).

Care should be taken to avoid deadlock. If iterators fit the following paradigm

Blob * out = new Hydra::Blob (NULL, sizeof abc);
Blob * in = new Hydra::Blob (NULL, sizeof xyz);

while (! finished) {
    // do useful work of some kind involing data abc and xyz

    // send messages to all neighbors
    self->RewindNeighbors ();
    while (Iterator * neigh = NextNeighbor ()) {
        out->Assign (abc);
        neigh->Send (out);
        }

    // get messages from all nieghbors
    self->RewindNeighbors ();
    while (Iterator * neigh = NextNeighbor ()) {
        neigh->Recieve (message);
        xyz = in->Value ();
        }
    }

deadlock cannot occur. More complicated forms of communication are supported by Hydra, but without deadlock detection or avoidance.

Clusters

Load Balancing

Iterators are active objects implemented as threads that do computations. On a shared memory multiprocessor (e.g., SGI Origin) the operating system dynamically assigns threads to CPU's to acheive the best overall load balance (including other users' programs). On a distributed memory machine (e.g., Linux cluster) threads do not migrate. How iterator threads are placed on the physical nodes can affect performance. One wants maximal parallellism by having many nodes, but also wants to attain good utilization of the hardware resources. Simply placing every iterator on its own node is wasteful because they alternate execution based on categories, so many nodes would be idle at any given time. Also, placing related iterators on one node cuts down on the overhead for communication since the physical network is not involved.

How to optimize the placement of Delft3D iterators on computational nodes is a non-trivial problem. The topology of the domain and the size and complexity of the subdomains can be used to determine a good placement. The width of the interfaces between subdomains is related to the amount of communication between objects. This is called the "affinity" of one iterator to another, and is a measure of how desireable it is to keep two iterators on the same node. The number of grid cells and types of the physical processes simulated can be used to estimate the computational requirement requirement of an iterator, which called its "weight". Affinities and weights are implied by the Delft3D input, and are independent on the underlying hardware configuration.

Given a set of affinities and weights and the number of nodes, a clustering algorithm can be used to place iterators on nodes. The algorithm is simpler if the nodes are identical and communication communcation costs the same between any pair of nodes. The WL Linux cluster "hydrax" conforms to this model.

Weights and affinities are non-negative integers and assumed to be on some common linear scale. They are specified as optional arguments to the iterator constructor and join method. Hydra does not attach any mean to the values themselves.

The general heuristic employed by the placement algorithm is to spread the processes over all clusters as evenly as possible based on their weights. Since this is equivalent to the bin packing problem, which is NP-complete, the goal is a reasonable distribution, not an optimal one. Next the mappers are placed in the same cluster as the subdomain with which it has the greatest affinity. If, however, too many mappers aggregate around a heavy process, some will be moved to the less busy cluster of their other process. Barriers are all left on the main cluster.

Automatic placement is done only if Place is never called. If not clusters have been created, Hydra will create one for every node of the underlying hardware before doing automatic placement.

Automatic placement based on weights and affinities is not yet implemented. Manual placement is.

API Routines

A Hydra cluster is an abstraction of a physical node in a distributed memory computer. Clusters contain a collection of zero or more iterator objects. The iterators in a cluster share a common address space. There is always one cluster (called the "main" cluster) present. Code for the configuration, initialization and finalization phases run in the main cluster and is also the default cluster for all iterators.

Additional clusters can be created with:

Cluster * cluster = new Hydra::Cluster (
    int * id,
    const char * name
    );

Iterator objects are placed in a cluster with:

void iterator->Place (
Cluster * cluster
);

Clusters should be created and populated in the configuration phase of Hydra execution.

Iterators can determine in which cluster they belong with:

Cluster * iterator->Cluster (
void
);

Two objects in the same cluster can share memory. If iterators have the same cluster ID (and hence name) the can share the same address space and C pointers can be used to communicate.

Debug Facility

Multithreaded programs are notoriously difficult to debug. Hydra provides a trace mechanism to help locate problems. The amount of output is determined by the notification level. Each successive level adds more detail.

0 = Nothing.
1 = Major program phases and the invocation and termination of iterators.
2 = Configuration information and message events.
3 = Everything except...
4 = Low-level MPI and Pthreads blocks and unblocks

The debug level can be set with:

void Hydra::SetDebugLevel (
int level
);

If done in the configuration phase it applies globally. Thereafter it only affects the host on which the iterator is executing.

[ToDo: describe output]

Developer Notes

API definitions are contained in include file "hydra.h". This file also contains non-API definitions, below the line "No API Beyond This Point"; they should not be used outside Hydra itself.

The namespace Hydra can be used to abbreviate names.

Hydra is available in the library "libhydra.a".

The implementation uses fixed sized buffers for various purposes, e.g., strings, configuration blobs, etc. The limits are printed in debug level 3 (not yet implemented).

IP sockets are used for communication in Hydra. Each iterator to iterator connection uses one socket on each host. Linux typically supports 1000 sockets (or more) per process.

The End.