Messages are sent with a fixed length message header, that specifices
message tag, source, count, etc.  A separate socket is used for
each of the possible n^2 communication patterns among n MPI processes.
Initially, only sockets between master and each slave are set up
(see below for more details).
Messages are buffered in the socket buffer as much as possible.
The primary current exception is when two messages with the same source
and destination, but different tags, are sent.  If the receiver specifies
a tag to read the second message first, then the receiver must extract
all messages from the buffer, and internally buffer them, until it finds
the desired one.

CALL_CHK is a macro defined to automatically check for success of
system calls, instead of having to hand-write the logic everywhere.
(It checks for -1, which is assumed to be failure.)  It also checks if
the cause of failure was EINTR (a signal), and if so, it re-starts the
system call.

SETSOCKOPT(sd) is another macro.  Currently, it sets the sockets
to SO_KEEPALIVE.  Sockets are NOT set to non-blocking.  Instead, select()
is used to avoid blocking on a recv() for which there is nothing ready.
This call to SETSOCKOPT() is currently done only after socket() calls,
but not after accept().  This works, since each connection is created
by a socket() call at one end, and an accept() call at the other end,
and it suffices for one end to keep the connection alive.  (Doing select()
on exceptional events would be an alternative to SO_KEEPALIVE, that would
not trigger SIGPIPE events for broken sockets.  This would be more robust.)
The sockets are blocking (default) and so send will either succeed (locally)
or return an error.  One must test the number of characters seen by recv(),
to verify that it received the entire message expected.
Some implementations, such as MPICH, set TCP_NODELAY so that small messages
are not delayed in being sent out.  In keeping with a minimalist
implementation, usually across Ethernet, the possible performance improvements
were not considered attractive enough.

Although there are no architecture conditionalizatins, things like
 "#if 0 ... #else .. . #endif" are used.  In these cases, one version of
the code is preferable, but there are potential incompatibilities with some of
the UNIX dialects.

Global variables are allocated near the beginning of mpi.c .
The main communication table is pg_array.  The index into the array of structs
is the rank of the process.  Most fields should be obvious.  For example,
the sd field is the socket descriptor for communication with the process
of that rank.  Initially, sockets are set up only between the original
process (rank 0) and each slave process.  In addition, each process
sets up a listener port.  The AF_INET address of that port is stored
in the listener_addr field of each pg_array entry.

At startup (MPI_Init), the socket connection between the original
process and slave is set up along with a listener port for the slave.
The slave then informs the master of the address of its listener port
in its first message (using send/recv).  The master then replies by
telling the slave his rank and the total number of slaves, in a message
in the format (struct init_msg).  The master then sends to each slave
the addresses of all slave listener ports.  Thus, the first time that
a slave wishes to communicate with another slave, the first slave
connects with the listener port of the second, in order to create
a new socket between the two slaves.

In this implementation, sockets are not set to SO_KEEPALIVE.
So, connected sockets can time out.  Any write to a connected socket
is tested with select before the write of the message header.  If a socket
is no longer alive, the old one is closed (at each end), and a new one
is connected.  Note that this implies that one must also test
a socket with select() before recv(), which would block.  Further
the call to select must include the listener port (MPINU_my_list_sd)
in case the original socket is dead and the sender needs to connect
a new socket through the listener port.

The other major data structure is (struct msg_hdr).  Before any message is
sent, this header is sent advertising the rank of the source, the tag of the
message, and the size of the data (and eventually the communicator when
communicators other than MPI_COMM_WORLD are implementated).

Note that there is a COLL_COMM_TAG which is not valid for ordinary users.
Collective communications can be emulated by ordinary point-to-point
communications, in which case the COLL_COMM_TAG is used to distinguish
such messages.  (In fact, there should be several COLL_COMM_TAG's,
one for each type of collective communication operation.)

=======================

It would be possible to have second process for communication that would
buffer all incoming processes, and thus also allow greater overlap of
communication and computation.  This is not currently done.

Another concern for future work is that when socket buffers fill up, a sending
process will block.  So, send will block on large messages, creating
the possibility of deadlock if two processes send each other large
messages before calling recv().  This is not currently addressed.

The plan to fix deadlock is to check if there are pending messages from
any process of lower rank before doing a send.  Any such pending messages
are received and stored in a buffer.  Similarly, if the send is
to the same process, then the message can be directly transferred
to the internal buffer without using any socket.  This avoids deadlock,
and it does not require the use of the buffers in the common case
in which all communication uses blocking and is between the master (original
process) and slaves. 

A moderate size internal buffer for messages is allocated in advance,
so that malloc() will be
needed only to store large messages.  In order to easily optimize
buffer manageent for a FIFO message queue, the internal buffer can have
its memory compressed whenever a message other than the first or
last one is read.  Further, when the last message threatens to overflow
the buffer, all of memory can be moved forward.

Addressing this issue of blocking could be handled by doing the following
before doing a new MPI_Send.  One can check the size field of incoming message
headers.  Any small messages can be read into a pre-allocated buffer, and one
can then test that there are no outstanding large messages capable of causing
deadlock.  If there are outstanding large messages, those would have to be
read into a large buffer, using malloc().  Thus, currently, malloc() need only
be used if messages are read out of order (for example by matching the tag of
a later message).  In this modification, any code requiring an MPI_Send while
a large incoming message is pending would also require malloc().
