12. High Availability Extensions

While GM automatically handles transient network errors such as dropped, corrupted, or misrouted packets, and while the GM mapper automatically reconfigures the network if links or nodes appear or disappear, GM cannot automatically handle catastrophic errors such as crashed hosts or loss of network connectivity without the cooperation of the client program.

When GM detects a catastrophic error, it temporarily disables the delivery of all messages with the same sender port, target port, and priority as the message that experienced the error, and GM informs the client of catastrophic network errors by passing a status other than GM_SUCCESS to the client's send completion callback routine. The client program is then expected to call either gm_resume_sending() or gm_drop_sends(), which reenable the delivery of messages with the same sender port, target port, and priority. This mechanism preserves the message order over the prioritized connection between the sending and receiving ports, while allowing the client to decide if the other packets that it has already enqueued over the same connection should be transmitted or dropped.

Simpler GM programs, such as MPI programs, will typically consider GM send errors to be fatal and will typically exit when they see a send error. This is reasonable for applications running on small or physically robust clusters where errors are rare and when users can tolerate restarting jobs in the rare event of a network error. Poorly written GM programs may simply ignore the error codes, which will cause the program to eventually hang with no error indication when catastrophic errors are encountered. This poor programming practice is strongly discouraged: Developers should always check the send completion status. More sophisticated applications, such as high availability database applications, will respond to the network faults, which appear to the client as send completion status codes other than GM_SUCCESS.

The send completion status codes are as follows:

GM_SUCCESS: The send succeeded. This status code does not indicate an error.
GM_SEND_TIMED_OUT: The target port is open and responsive and the message is of an acceptable size, but the receiver failed to provide a matching receive buffer within the timeout period. This error can be caused by the receive neglecting its responsibility to provide receive buffers in a timely fashion or crashing. It can also be caused by severe congestion at the receiving node where many senders are contending for the same receive buffers on the target port for an extended period. This error indicates a programming error in the client software.
GM_SEND_REJECTED: The receiver indicated (in a call to gm_set_acceptable_sizes()) the size of the message was unacceptable. This error indicates a programming error in the client software.
GM_SEND_TARGET_PORT_CLOSED: The message cannot be delivered because the destination port has been closed.
GM_SEND_TARGET_NODE_UNREACHABLE: The target node could not be reached over the Myrinet. This error can be caused by the network becoming disconnected for too long, the remote node being powered off, or by network links being rearranged when the Myrinet mapper is not running.
GM_SEND_DROPPED: The send was dropped at the client's request. (The client called gm_drop_sends().) This status code does not indicate an error.
GM_SEND_PORT_CLOSED: Clients should never see this internal error code.

When the send completion status code indicates an error a sophisticated client program may respond by calling gm_resume_sending() or gm_drop_sends(). Calling gm_resume_sending() causes GM to simply reenable delivery of subsequent messages over the connection, including those that have already been enqueued. This would be the typical response of a distributed database that assumes the underlying network is unreliable and layers its own reliability protocol over GM. Calling gm_drop_sends() causes GM to drop all enqueued sends over the disabled connection, return them to the client with status GM_SEND_DROPPED, and reenable the connection. This would be the typical response of a program that wishes to reorder subsequent communication over the connection in response to the error.

Note that each of the fault response functions (gm_drop_sends() and gm_resume_sending()) requires a send token. This send token is implicitly returned to the caller when the callback function passed to gm_drop_sends() or gm_resume_sending() is called by GM.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Glenn Brown on October, 18 2001 using texi2html