Tag Archives: mpi

Open MPI and Debian

Branden sent me a link from Planet Debian where one of the Debian developers talks about improving the packaging for Open MPI in Debian. Nice to see people outside the core development community get excited about Open MPI.

Now if only we could get people to spell it Open MPI instead of OpenMPI 😉

Open MPI in Mac OS X!

http://www.apple.com/macosx/leopard/technology/multicore.html
Apple finally announced that Mac OS X 10.5 (Leopard) will include Open MPI as part of its developer tools. The discussion to make this happen started over 18 months ago, and it’s great to finally see it announced. I’ve been working on this for a long time and it looks like we’re finally there. Woo hoo!

Sun’s working on adding some DTrace hooks into Open MPI for tracking things like message arrival and the like. Won’t make it into 1.2, but should be nice once they’re in, especially considering the announced support for DTrace on Leopard.

Happiness on the Open MPI front!

{C++, Objective C, Fortran 77, Fortran} link compatibility with C

A not-unfrequent problem we have with Open MPI is users specifying incompatible compiler flags for the different languages used in Open MPI. This frequently occurs when a user wants to build 64-bit on a system that defaults to 32-bit and provides the right CFLAGS, but skips one of the FFLAGS, FCFLAGS, CXXFLAGS, or OBJCFLAGS. If you’re lucky, you’ll get an amorphous error during configure time saying something completely unrelated to the original problem went wrong. If you’re unlucky, nothing bad will happen until you try to link ompi_info, which is one of the last things that gets built.

I wrote a macro for Open MPI today that tries to link a small C function in a .o file with a main function in either Fortran (either 90 or 77) or one of the C-like languages (C++ / Objective C) to see if they link. The result is not pretty, and I’m hoping there’s a better way to do this, but this is what I came up with (below). The only part I’m really not happy with is all the code to deal with the Fortran calling. I’m sure there’s a better way than defining a particular m4 variable then looking at that for the rest of the function. But such is life…

dnl -*- shell-script -*-
dnl
dnl Copyright (c) 2006      Los Alamos National Security, LLC.  All rights
dnl                         reserved. 
dnl $COPYRIGHT$
dnl 
dnl Additional copyrights may follow
dnl 
dnl $HEADER$
dnl

# OMPI_LANG_LINK_WITH_C(language)
# -------------------------------
# Try to link a small test program against a C object file to make
# sure the compiler for the given language is compatible with the C
# compiler.
AC_DEFUN([OMPI_LANG_LINK_WITH_C], [
  AS_VAR_PUSHDEF([lang_var], [ompi_cv_c_link_$1])

  AC_CACHE_CHECK([if C and $1 are link compatible],
    lang_var,
    [m4_if([$1], [Fortran], 
       [m4_define([ompi_lang_link_with_c_fortran], 1)],
       [m4_if([$1], [Fortran 77],
          [m4_define([ompi_lang_link_with_c_fortran], 1)],
          [m4_define([ompi_lang_link_with_c_fortran], 0)])])
     m4_if(ompi_lang_link_with_c_fortran, 1,
       [OMPI_F77_MAKE_C_FUNCTION([testfunc_name], [testfunc])],
       [testfunc_name="testfunc"])

     # Write out C part
     AC_LANG_PUSH(C)
     rm -f conftest_c.$ac_ext
      cat > conftest_c.$ac_ext << EOF
int $testfunc_name(int a);
int $testfunc_name(int a) { return a; }
EOF

     # Now compile both parts
     OMPI_LOG_COMMAND([$CC -c $CFLAGS $CPPFLAGS conftest_c.$ac_ext],
       [AC_LANG_PUSH($1)
        ompi_lang_link_with_c_libs="$LIBS"
        LIBS="conftest_c.o $LIBS"
        ompi_lang_link_with_c_fortran
        m4_if(ompi_lang_link_with_c_fortran, 1, 
          [AC_LINK_IFELSE([AC_LANG_PROGRAM([], [
       external testfunc
       call testfunc(1)
])],
             [AS_VAR_SET(lang_var, ["yes"])], [AS_VAR_SET(lang_var, ["no"])])],
          [AC_LINK_IFELSE([AC_LANG_PROGRAM([
#if defined(c_plusplus) || defined(__cplusplus)
extern "C" int testfunc(int);
#else
extern int testfunc(int);
#endif
], 
             [return testfunc(0);])],
             [AS_VAR_SET(lang_var, ["yes"])], [AS_VAR_SET(lang_var, ["no"])])])
        LIBS="$ompi_lang_link_with_c_libs"
        AC_LANG_POP($1)],
       [AS_VAR_SET(lang_var, ["no"])])
     rm -f conftest_c.$ac_ext
     AC_LANG_POP(C)])

  AS_IF([test "AS_VAR_GET([lang_var])" = "yes"], [$2], [$3])
  AS_VAR_POPDEF([lang_var])dnl
])

AC and AM version numbers

I ran into a problem with Open MPI’s build system today and have come up with a reasonable, but sub-optimal solution. The basics of the problem are that we need to support both the combinations of Autoconf 2.59/Automake 1.9.6 and Autoconf 2.60+/Automake 1.10+ in a project using Objective C (only for a very small, optional component). AC 2.59/AM 1.9.6 have basically zero support for Objective C, but AC 2.60+/AM 1.10 have full support for Objective C. I had back-ported in a bunch of macros to get the bare minimum support we needed for Objective C with AC 2.59, but that all conflicted with AC 2.60, so needed to be optionally provided. So there were two problems:

  • How to determine whether to provide compatibility AC_PROG_OBJC or use AC’s macro
  • How to prevent using AC 2.60 with AM 1.9.6

The problem with the first one is that looking at AC_PROG_OBJC seemed to cause all kinds of entertaining things to happen with AC 2.60. So the solution there was to peek at the internal structures for AC and get the version number, then act accordingly. After figuring out all the macros I needed to provide, this worked reasonably well. Look at config/ompi_objc.m4 to see which ones we ended up needing.

Automake doesn’t seem to provide an internal macro with its version number, so I couldn’t do the obvious to find the version number and do all the logic to make sure we never see AC 2.60 used with AM 1.9.6 (which would give us half support for Objective C and be a major pain). So the best I could come up with was to use the version requirement checking feature of AM_INIT_AUTOMAKE:

m4_if(m4_version_compare(m4_defn([m4_PACKAGE_VERSION]), [2.60]), -1,
  [AM_INIT_AUTOMAKE([foreign dist-bzip2 subdir-objects no-define])],
  [AM_INIT_AUTOMAKE([foreign dist-bzip2 subdir-objects no-define 1.10])])

It’s really vile, but it does work…

Sigh….

There have been two articles in HPCWire during the last 2 months on the use of RDMA in HPC, particularly for implementing the MPI specification. The first, A Critique of RDMA, is written by Patrick Geoffray of Myricom. Patrick has worked with the Open MPI team on improving our support for the Myrinet/MX communication interface. The article assumes a good deal of knowledge about HPC, but is a good read if you know anything about MPI. The summary is that RDMA interfaces with explicit memory registration are difficult to use when implementing the matching send/receive rules of MPI.

In response, a chief engineer at IBM (who makes InfiniBand cards, which are RDMA with explicit memory registration), wrote an attempt at a reasonable reply, A Tutorial of the RDMA Model. The article is both sad and insulting to MPI implementors. The article’s opening paragraph is:

RDMA encompasses more than can be encapsulated by a reference to RDMA Writes and RDMA Reads. The reduction of the RDMA programming model by describing a poor mapping over MPI or for that matter Sockets indicates a limited understanding of the extensive capabilities of the full RDMA Model.

The problem with this statement is that you’ll find few MPI implementors that believe RDMA is a good model for implementing MPI-1 send/receive semantics, especially those that have tried. And Patrick has maintained MPICH-GM, an implementation of MPI-1 over the Myrinet/GM interface, which is a RDMA with explicit memory registration interface. It is also clear that Mr. Recio is unfamiliar with the MPI standard and it’s nuances. For example, in response to Patrick’s comments about copying small messages and registration/deregistration usage for large messages, Mr. Recio claims that “long-lived registrations provide the lowest overhead”. This statement is true, but misses Patrick’s point. The following code is perfectly legal in MPI:

char *foo = malloc(16);
[populate foo with data]
MPI_Send(foo, 16, MPI_CHAR, ....);
MPI_Send(foo, 16, MPI_CHAR, ....);
free(foo);

An MPI implementation over InfiniBand (using OpenIB or mVAPI) has a couple of choices to implement MPI send. The straight-forward solution is to pin the buffer, send the data to the remote process, and unpin the buffer. The problem with this is that the registration/deregistration cost will generally be higher than the cost of the send itself. So one option would be to leave the buffer registered and hope the user re-uses the buffer. Ok, so now we get to the free() call. The results of free()ing pinned memory differs from OS to OS, but it’s never good [1].

So what’s an MPI implementor to do? The short message answer starts with a bounce buffer, a pre-registered buffer held internally by the MPI implementation. The data for the send is copied into the bounce buffer, where it is then sent. If the MPI is really trying to get aggressive about latency, it might use RDMA instead of send/receive for the short messages, but it’s still being pushed out of a bounce buffer. On the receiver side, I’ve yet to hear of an MPI implementation over RDMA with explicit registration do anything but receive the short message into yet another bounce buffer. Why? The short message isn’t the only thing being sent. Because there’s only ordered matching for send/receive on these RDMA networks, a MPI-internal header has to be sent as well. Until that header is analyzed, it’s impossible to know where the message is supposed to be delivered.

Longer messages are a different story. There are a number of options. For medium sized messages, a pipeline of copies and sends works well. For large messages (>128K on modern networks), the copy pipeline protocol results in much lower bandwidth than the network is capable of delivering. For optimal performance, it is better to pin the user buffer and RDMA directly into the user’s receive buffer. This can be done by pipelining the registration / rdma / deregistration (an algorithm the Open MPI team has worked hard to optimize and has published on), or by leaving the user buffer pinned, which is how you get optimal bandwidth on NetPIPE. Pinning such a large buffer has high initial cost, so buffer reuse is critical in this “leave pinned” case. A third option, developed by the Open MPI team, is a combination of the two. A registration pipeline is used to optimize the speed of the first send, but the buffer can be left pinned for later reuse. While we implement the leave pinned options, they aren’t the default and have to be explicitly enabled. Why? because of the free() problem described earlier. We have to track memory manager usage by intercepting free(), munmap(), and friends in order to deregister the memory and update our caches before giving the memory back to the OS. This is error prone and frequently causes problems with applications that need to do their own memory management (which is not uncommon) in HPC apps. Other MPI implementations deal with it in other ways (like not allowing malloc/free to give memory back to the OS). These MPI implementations are frequently known for crashing on applications with aggressive use of the memory manager.

The final point that really annoyed me in Mr. Recio’s article was the comment:

For MPI, both Mvapich and Open MPI have moved beyond N-1 RDMA connections and use dynamic and adaptive mechanisms for managing and restricting RDMA connections to large data transfers and frequently communicating processes.

This is true, in that Open MPI has done all of these things. However, in order to implement support for Open IB in Open MPI, quite a bit more work was required than to implement support for MX. Proof can be shown in a simple LOC count (includes comments, but both are similarily commented):

Device Lines of code
Open IB BTL 5751
MX BTL 1780
OB1 PML 6283
CM PML 2137
MX MTL 1260

The PML components (OB1 / CM) both implement the MPI point-to-point semantics. OB1 is designed to drive RDMA devices, implemented as BTLs. The CM PML is designed to drive library-level matching devices (MX, InfiniPath, and Portals), implemented as MTLs. OB1 includes logic to handle the various pinning modes described above. The Open IB BTL includes short message RDMA, short message send/receive, and true RDMA. The MX MTL includes short and long message send/receive. The CM PML is a thin wrapper around the MTLs, which are very thin wrappers around the device libraries. As you can see, it takes significantly less code to implement an MX BTL than a Open IB BTL. The difference is even more startling when you compare the MX MTL/CM PML (3397 LOC) and the Open IB BTL/OB1 PML (12034 LOC). This isn’t exactly a fair comparison, as OB1 includes support for multi-device stripping. On the other hand, the MX library handles those details internally, so perhaps it is a fair comparison.

As an MPI implementor, I dislike RDMA interfaces with explicit memory registration. Quadrics, which can do RDMA without explicit memory registration by linking the NIC with the kernel’s memory manager, offer many of the benefits of RDMA devices without the registration problems. But there are still copies for short messages in many cases. Most importantly, Quadrics is much more expensive than InfiniBand, frequently an unjustifiable cost when building a cluster. Portals offers a good combination of RDMA and send/receive that is extremely powerful. Implementing an MPI is more difficult than over MX, but it is possible to implement interfaces other than MPI, which is a useful feature. MX and InfiniPath offer a trivial MPI implementation, with excellent latency and bandwidth.

There is one good thing about InfiniBand Mr. Recio doesn’t mention. It is so hard to implement an MPI over these devices that two groups (Open IB and MVAPICH) have had great success at publishing papers about hacks to get decent performance out of the interconnect.

[1] On Linux, the memory will be deregistered and returned to the OS implicitly. But the MPI’s tables on which pages are pinned haven’t been updated. So when you inevitably get that page back from the OS for a new malloc() call and try to send from it, the cache will think the page is register it and not try to register it. Leading to the MPI sending from an unregistered page, which frequently leads to incorrect data transmission. On OS X, on the other hand, free() will block until all pages in the allocation are deregistered. Which means you’ll deadlock.

Heterogeneous Open MPI

Committed a reasonably large patch to the Open MPI development trunk last night so that the datatype engine does a reasonable first approximation of the “right” thing in heterogeneous environments. Right now, it deals with endian differences and the differences in representation (size and such) of C++ bool and Fortran LOGICAL. Some work still needs to be done, such as dealing with different representations of long double values. The run-time environment and PML code had already been fixed up to be endian-clean. Still have one issue to fix in the run-time layer before mixing 32 bit and 64 bit code will work properly, however. Hopefully, that will come in the not too distant future.

This work won’t be part of Open MPI 1.0.2, but it should make it into some release down the line, most likely 1.1.

Open MPI quickies

Some random quickies on the Open MPI front that may or may not interest you…

  • Jeff has left the building. Well, Jeff left the building a long time ago, but he’s leaving IU really soon now to take a job a Cisco. More information on his blog.
  • The group has gotten a lot bigger lately. New members include:
    • Cisco Systems (who hired Jeff away from us to work on Open MPI)
    • Voltaire
    • Sun, who has a couple blog posts on the move here and here.
  • The one-sided implementation in Open MPI continues to be debugged, but I have high confidence it will be ready to ship as part of Open MPI v1.1. Woo hoo!
  • George has gone on a performance binge lately, which has resulted in much lower latency for all our interconnects and much better bandwidth for our shared memory transport.

We’re done now, right?

Committed implementations of MPI_Win_lock and MPI_Win_unlock yesterday, which I believe are the last two functions from MPI-1 and MPI-2 for which Open MPI did not have implemented. Now all we have to do is make it bug free, make it go faster, and add a bunch of cool research projects so we can all graduate. Not necessarily in that order.

Compilers Suck!

[8:46] brbarret@traal:ttyp0 XL% cat xlsucks.cc 
#include <memory>

int
main(int argc, char *argv[])
{
  return 0;
}
[8:46] brbarret@traal:ttyp0 XL% mkdir memory
[8:46] brbarret@traal:ttyp0 XL% xlc++ xlsucks.cc -o xlsucks
[8:46] brbarret@traal:ttyp0 XL% xlc++ -I. xlsucks.cc -o xlsucks
1540-0820 (S) Unable to read the file ./memory. Operation not permitted.
"./memory", line 1.0: 1540-0809 (W) The source file is empty.
[8:46] brbarret@traal:ttyp0 XL%

Yes, it’s opening a directory for reading, thinking it’s a file. Someone forgot the S_IFDIR check when they were stat()ing around looking for the header. This came up because Open MPI has a directory $(top_srcdir)/opal/memory/ and includes a global CFLAGS that includes -I$(top_srcdir)/opal and memory is a header file in the GNU C++ standard library that is included just about everywhere else, and instead of finding that memory, it was finding our directory. This took me a while to track down…