Sigh….

There have been two articles in HPCWire during the last 2 months on the use of RDMA in HPC, particularly for implementing the MPI specification. The first, A Critique of RDMA, is written by Patrick Geoffray of Myricom. Patrick has worked with the Open MPI team on improving our support for the Myrinet/MX communication interface. The article assumes a good deal of knowledge about HPC, but is a good read if you know anything about MPI. The summary is that RDMA interfaces with explicit memory registration are difficult to use when implementing the matching send/receive rules of MPI.

In response, a chief engineer at IBM (who makes InfiniBand cards, which are RDMA with explicit memory registration), wrote an attempt at a reasonable reply, A Tutorial of the RDMA Model. The article is both sad and insulting to MPI implementors. The article’s opening paragraph is:

RDMA encompasses more than can be encapsulated by a reference to RDMA Writes and RDMA Reads. The reduction of the RDMA programming model by describing a poor mapping over MPI or for that matter Sockets indicates a limited understanding of the extensive capabilities of the full RDMA Model.

The problem with this statement is that you’ll find few MPI implementors that believe RDMA is a good model for implementing MPI-1 send/receive semantics, especially those that have tried. And Patrick has maintained MPICH-GM, an implementation of MPI-1 over the Myrinet/GM interface, which is a RDMA with explicit memory registration interface. It is also clear that Mr. Recio is unfamiliar with the MPI standard and it’s nuances. For example, in response to Patrick’s comments about copying small messages and registration/deregistration usage for large messages, Mr. Recio claims that “long-lived registrations provide the lowest overhead”. This statement is true, but misses Patrick’s point. The following code is perfectly legal in MPI:

char *foo = malloc(16);
[populate foo with data]
MPI_Send(foo, 16, MPI_CHAR, ....);
MPI_Send(foo, 16, MPI_CHAR, ....);
free(foo);

An MPI implementation over InfiniBand (using OpenIB or mVAPI) has a couple of choices to implement MPI send. The straight-forward solution is to pin the buffer, send the data to the remote process, and unpin the buffer. The problem with this is that the registration/deregistration cost will generally be higher than the cost of the send itself. So one option would be to leave the buffer registered and hope the user re-uses the buffer. Ok, so now we get to the free() call. The results of free()ing pinned memory differs from OS to OS, but it’s never good [1].

So what’s an MPI implementor to do? The short message answer starts with a bounce buffer, a pre-registered buffer held internally by the MPI implementation. The data for the send is copied into the bounce buffer, where it is then sent. If the MPI is really trying to get aggressive about latency, it might use RDMA instead of send/receive for the short messages, but it’s still being pushed out of a bounce buffer. On the receiver side, I’ve yet to hear of an MPI implementation over RDMA with explicit registration do anything but receive the short message into yet another bounce buffer. Why? The short message isn’t the only thing being sent. Because there’s only ordered matching for send/receive on these RDMA networks, a MPI-internal header has to be sent as well. Until that header is analyzed, it’s impossible to know where the message is supposed to be delivered.

Longer messages are a different story. There are a number of options. For medium sized messages, a pipeline of copies and sends works well. For large messages (>128K on modern networks), the copy pipeline protocol results in much lower bandwidth than the network is capable of delivering. For optimal performance, it is better to pin the user buffer and RDMA directly into the user’s receive buffer. This can be done by pipelining the registration / rdma / deregistration (an algorithm the Open MPI team has worked hard to optimize and has published on), or by leaving the user buffer pinned, which is how you get optimal bandwidth on NetPIPE. Pinning such a large buffer has high initial cost, so buffer reuse is critical in this “leave pinned” case. A third option, developed by the Open MPI team, is a combination of the two. A registration pipeline is used to optimize the speed of the first send, but the buffer can be left pinned for later reuse. While we implement the leave pinned options, they aren’t the default and have to be explicitly enabled. Why? because of the free() problem described earlier. We have to track memory manager usage by intercepting free(), munmap(), and friends in order to deregister the memory and update our caches before giving the memory back to the OS. This is error prone and frequently causes problems with applications that need to do their own memory management (which is not uncommon) in HPC apps. Other MPI implementations deal with it in other ways (like not allowing malloc/free to give memory back to the OS). These MPI implementations are frequently known for crashing on applications with aggressive use of the memory manager.

The final point that really annoyed me in Mr. Recio’s article was the comment:

For MPI, both Mvapich and Open MPI have moved beyond N-1 RDMA connections and use dynamic and adaptive mechanisms for managing and restricting RDMA connections to large data transfers and frequently communicating processes.

This is true, in that Open MPI has done all of these things. However, in order to implement support for Open IB in Open MPI, quite a bit more work was required than to implement support for MX. Proof can be shown in a simple LOC count (includes comments, but both are similarily commented):

Device Lines of code
Open IB BTL 5751
MX BTL 1780
OB1 PML 6283
CM PML 2137
MX MTL 1260

The PML components (OB1 / CM) both implement the MPI point-to-point semantics. OB1 is designed to drive RDMA devices, implemented as BTLs. The CM PML is designed to drive library-level matching devices (MX, InfiniPath, and Portals), implemented as MTLs. OB1 includes logic to handle the various pinning modes described above. The Open IB BTL includes short message RDMA, short message send/receive, and true RDMA. The MX MTL includes short and long message send/receive. The CM PML is a thin wrapper around the MTLs, which are very thin wrappers around the device libraries. As you can see, it takes significantly less code to implement an MX BTL than a Open IB BTL. The difference is even more startling when you compare the MX MTL/CM PML (3397 LOC) and the Open IB BTL/OB1 PML (12034 LOC). This isn’t exactly a fair comparison, as OB1 includes support for multi-device stripping. On the other hand, the MX library handles those details internally, so perhaps it is a fair comparison.

As an MPI implementor, I dislike RDMA interfaces with explicit memory registration. Quadrics, which can do RDMA without explicit memory registration by linking the NIC with the kernel’s memory manager, offer many of the benefits of RDMA devices without the registration problems. But there are still copies for short messages in many cases. Most importantly, Quadrics is much more expensive than InfiniBand, frequently an unjustifiable cost when building a cluster. Portals offers a good combination of RDMA and send/receive that is extremely powerful. Implementing an MPI is more difficult than over MX, but it is possible to implement interfaces other than MPI, which is a useful feature. MX and InfiniPath offer a trivial MPI implementation, with excellent latency and bandwidth.

There is one good thing about InfiniBand Mr. Recio doesn’t mention. It is so hard to implement an MPI over these devices that two groups (Open IB and MVAPICH) have had great success at publishing papers about hacks to get decent performance out of the interconnect.

[1] On Linux, the memory will be deregistered and returned to the OS implicitly. But the MPI’s tables on which pages are pinned haven’t been updated. So when you inevitably get that page back from the OS for a new malloc() call and try to send from it, the cache will think the page is register it and not try to register it. Leading to the MPI sending from an unregistered page, which frequently leads to incorrect data transmission. On OS X, on the other hand, free() will block until all pages in the allocation are deregistered. Which means you’ll deadlock.