Category Archives: Tech

Sigh….

There have been two articles in HPCWire during the last 2 months on the use of RDMA in HPC, particularly for implementing the MPI specification. The first, A Critique of RDMA, is written by Patrick Geoffray of Myricom. Patrick has worked with the Open MPI team on improving our support for the Myrinet/MX communication interface. The article assumes a good deal of knowledge about HPC, but is a good read if you know anything about MPI. The summary is that RDMA interfaces with explicit memory registration are difficult to use when implementing the matching send/receive rules of MPI.

In response, a chief engineer at IBM (who makes InfiniBand cards, which are RDMA with explicit memory registration), wrote an attempt at a reasonable reply, A Tutorial of the RDMA Model. The article is both sad and insulting to MPI implementors. The article’s opening paragraph is:

RDMA encompasses more than can be encapsulated by a reference to RDMA Writes and RDMA Reads. The reduction of the RDMA programming model by describing a poor mapping over MPI or for that matter Sockets indicates a limited understanding of the extensive capabilities of the full RDMA Model.

The problem with this statement is that you’ll find few MPI implementors that believe RDMA is a good model for implementing MPI-1 send/receive semantics, especially those that have tried. And Patrick has maintained MPICH-GM, an implementation of MPI-1 over the Myrinet/GM interface, which is a RDMA with explicit memory registration interface. It is also clear that Mr. Recio is unfamiliar with the MPI standard and it’s nuances. For example, in response to Patrick’s comments about copying small messages and registration/deregistration usage for large messages, Mr. Recio claims that “long-lived registrations provide the lowest overhead”. This statement is true, but misses Patrick’s point. The following code is perfectly legal in MPI:

char *foo = malloc(16);
[populate foo with data]
MPI_Send(foo, 16, MPI_CHAR, ....);
MPI_Send(foo, 16, MPI_CHAR, ....);
free(foo);

An MPI implementation over InfiniBand (using OpenIB or mVAPI) has a couple of choices to implement MPI send. The straight-forward solution is to pin the buffer, send the data to the remote process, and unpin the buffer. The problem with this is that the registration/deregistration cost will generally be higher than the cost of the send itself. So one option would be to leave the buffer registered and hope the user re-uses the buffer. Ok, so now we get to the free() call. The results of free()ing pinned memory differs from OS to OS, but it’s never good [1].

So what’s an MPI implementor to do? The short message answer starts with a bounce buffer, a pre-registered buffer held internally by the MPI implementation. The data for the send is copied into the bounce buffer, where it is then sent. If the MPI is really trying to get aggressive about latency, it might use RDMA instead of send/receive for the short messages, but it’s still being pushed out of a bounce buffer. On the receiver side, I’ve yet to hear of an MPI implementation over RDMA with explicit registration do anything but receive the short message into yet another bounce buffer. Why? The short message isn’t the only thing being sent. Because there’s only ordered matching for send/receive on these RDMA networks, a MPI-internal header has to be sent as well. Until that header is analyzed, it’s impossible to know where the message is supposed to be delivered.

Longer messages are a different story. There are a number of options. For medium sized messages, a pipeline of copies and sends works well. For large messages (>128K on modern networks), the copy pipeline protocol results in much lower bandwidth than the network is capable of delivering. For optimal performance, it is better to pin the user buffer and RDMA directly into the user’s receive buffer. This can be done by pipelining the registration / rdma / deregistration (an algorithm the Open MPI team has worked hard to optimize and has published on), or by leaving the user buffer pinned, which is how you get optimal bandwidth on NetPIPE. Pinning such a large buffer has high initial cost, so buffer reuse is critical in this “leave pinned” case. A third option, developed by the Open MPI team, is a combination of the two. A registration pipeline is used to optimize the speed of the first send, but the buffer can be left pinned for later reuse. While we implement the leave pinned options, they aren’t the default and have to be explicitly enabled. Why? because of the free() problem described earlier. We have to track memory manager usage by intercepting free(), munmap(), and friends in order to deregister the memory and update our caches before giving the memory back to the OS. This is error prone and frequently causes problems with applications that need to do their own memory management (which is not uncommon) in HPC apps. Other MPI implementations deal with it in other ways (like not allowing malloc/free to give memory back to the OS). These MPI implementations are frequently known for crashing on applications with aggressive use of the memory manager.

The final point that really annoyed me in Mr. Recio’s article was the comment:

For MPI, both Mvapich and Open MPI have moved beyond N-1 RDMA connections and use dynamic and adaptive mechanisms for managing and restricting RDMA connections to large data transfers and frequently communicating processes.

This is true, in that Open MPI has done all of these things. However, in order to implement support for Open IB in Open MPI, quite a bit more work was required than to implement support for MX. Proof can be shown in a simple LOC count (includes comments, but both are similarily commented):

Device Lines of code
Open IB BTL 5751
MX BTL 1780
OB1 PML 6283
CM PML 2137
MX MTL 1260

The PML components (OB1 / CM) both implement the MPI point-to-point semantics. OB1 is designed to drive RDMA devices, implemented as BTLs. The CM PML is designed to drive library-level matching devices (MX, InfiniPath, and Portals), implemented as MTLs. OB1 includes logic to handle the various pinning modes described above. The Open IB BTL includes short message RDMA, short message send/receive, and true RDMA. The MX MTL includes short and long message send/receive. The CM PML is a thin wrapper around the MTLs, which are very thin wrappers around the device libraries. As you can see, it takes significantly less code to implement an MX BTL than a Open IB BTL. The difference is even more startling when you compare the MX MTL/CM PML (3397 LOC) and the Open IB BTL/OB1 PML (12034 LOC). This isn’t exactly a fair comparison, as OB1 includes support for multi-device stripping. On the other hand, the MX library handles those details internally, so perhaps it is a fair comparison.

As an MPI implementor, I dislike RDMA interfaces with explicit memory registration. Quadrics, which can do RDMA without explicit memory registration by linking the NIC with the kernel’s memory manager, offer many of the benefits of RDMA devices without the registration problems. But there are still copies for short messages in many cases. Most importantly, Quadrics is much more expensive than InfiniBand, frequently an unjustifiable cost when building a cluster. Portals offers a good combination of RDMA and send/receive that is extremely powerful. Implementing an MPI is more difficult than over MX, but it is possible to implement interfaces other than MPI, which is a useful feature. MX and InfiniPath offer a trivial MPI implementation, with excellent latency and bandwidth.

There is one good thing about InfiniBand Mr. Recio doesn’t mention. It is so hard to implement an MPI over these devices that two groups (Open IB and MVAPICH) have had great success at publishing papers about hacks to get decent performance out of the interconnect.

[1] On Linux, the memory will be deregistered and returned to the OS implicitly. But the MPI’s tables on which pages are pinned haven’t been updated. So when you inevitably get that page back from the OS for a new malloc() call and try to send from it, the cache will think the page is register it and not try to register it. Leading to the MPI sending from an unregistered page, which frequently leads to incorrect data transmission. On OS X, on the other hand, free() will block until all pages in the allocation are deregistered. Which means you’ll deadlock.

OS X Fun with Shell Scripts

Discovered this the other day, and it’s just so cool (and yet so useless). In OS X land, one can make a .app directory structure to create an application bundle and the actual executable doesn’t have to be a Cocoa / Carbon application. It can actually be a shell script, or a C program, or Perl, or whatever. Double click goodness and everything. And most importantly to me, it can be added to the per-user login items, without popping up a terminal window (which is what happens if you just add a script with a .command extension). So I have a little shell script application bundle that runs at login on the desktops in the lab to make sure that there’s all the scratch directories I want on our scratch disks (so that things like Safari’s cache are on local, fast disk instead of global, slow NFS). Woo Apple — every now and then, they do get something right. Credit must go to MacEnterprise.org for the hint.

random acts of geekdom

Backups are good

Since my backup strategy of throw a tarfile somewhere and pray didn’t work out so well after the TiBook died, I decided it was time to look into something more, um, proper. I discovered that rdiff-backup could do almost exactly what I wanted (nightly backups of most of my homedir onto my local LInux server with incrementals), without all the mess that is trying to do incremental backups with tar. And it supports saving resource forks from OS X, even when the backup machine doesn’t (like, say, my Linux server). Woo! I’m using a combination of rdiff-backup for most of my homedir and Unison for directories like Music that I want synchronized between multiple machines (laptop, linux desktop, and workstation at the lab). Everything happens by cron script, so it should be mostly foolproof. I also made the backup target on the Linux server it’s own LVM partition so that I can remount it read-only as soon as I know I’m going to need to restore from it, avoiding the rm -rf issues that screwed me this time.

Apartment streams

Should you happen to end up in our apartment, you can now stream the giant repository of mp3s from mori (the local linux server) via either iTunes or the TiVo. Yeah for mt-daapd and JavaHMO. Oh, and the printer is now also broadcast as existing by CUPS, so my PowerBook “just finds it”. woot! Unfortunately, Red Hat has this really lame thing with their printer setup where the Info field is set to some magic string that can’t be changed and doesn’t help ID the printer. Which really sucks, as that’s what OS X shows in it’s printer selection box. I think I might disable the autodetection crap in RHAS, setup CUPS’ printers.conf on my own, and set the Info field to something useful. I find this weird, since Fedora Core 1 got it right. Sigh.

You keep using that word. I do not think it means what you think it means.

Objective C is weird

I’ve been spending some time working on making Open MPI utilize Apple’s XGrid technology for process startup and the like. In the end, the difficult parts turned out to be the fact that the XGrid framework is not particularily well documented and that I don’t know Objective C. That’s right – The XGrid interface is in Objective C, so I had to write the xgrid process starter component for Open MPI in Objective C. Which has the nice side-effect of us now being able to say that we have a component written in a language other than C (if only just barely – ObjC is just barely not C). Anyway, very interesting stuff, and I got to learn all about the NSRunLoop. Yippie.

Caps Lock Goodness

In Tiger, Apple added the ability to swap caps lock and cntl keys from the Keyboard and Mouse preferences pane. There’s a button under keyboard labeled Modifier Keys – you can set what modifier is set by which key. Worked like a champ on on my TiBook. As with uControl, however, the light for the caps lock key would still alter for every press of the caps lock key. Not a big deal, but kind of annoying. On Laura’s brand new, shiny AlBook, the light only switches when the key mapped as the caps lock key changes (which doesn’t have to be the caps key, obviously). Finally, Apple fixed the annoying caps lock key on their Power Books.

Useless hack of the day

Taken from the Common TCSH Completions page, with a little bit of hackery to get a reasonable host list:

set ssh_hostlist=( `cut -f1 -d' ' ~/.ssh/known_hosts | cut -f1 -d',' | xargs`)

complete ssh 'p/1/$ssh_hostlist/' 'p/2/c/'
# rcp and scp allow arguments to be references to either local or remote
# files.  It's impossible to complete remote files, but its useful to assume
# that the remote file structure is similar to the local one.
#
# when you first start typing, it could be any of a username, hostname,
# or filename.  But filename is probably the most useful case, so:
#
# complete arguments as regular filenames, with following exceptions
# if "@" has been typed, complete with a hostname, and append a colon (:)
# if ":" has been typed, complete with a filename relative to home directory
# if ":/" has been typed, complete with a filename relative to root directory
# 
complete scp "c,*:/,F:/," "c,*:,F:$HOME," 'c/*@/$ssh_hostlist/:/'

modules are the best thing since sliced bread

So my shell configuration files finally imploded badly this weekend. I’ve copied around the same set of configuration files for a couple of years now (since starting to work in the LSC at Notre Dame). Over time, they have grown to immense, incomprehendable proportions. Lately, I’ve been working on two different versions of the exact same interface (OMPI and LAM, MPI). Both with the same commands, both with the same library name, etc. In order to keep everything straight, I’ve had a couple of aliases that do evil things to hack up my path so that which ever project I was working on at that time was in my PATH, MANPATH, LD_LIBRARY_PATH, etc. It was evil.

So I ripped it all out and started over this weekend, using the modules package as the core of my configuration. Which has the advantage of making life a hell of a lot easier when I have to use a machine without tcsh as my default shell (and such machines do, unfortunately, exist). Anyway, my shell configuration is now much smaller and much more modular (har, har). The code for setting up my Autotools installation is in one place, the code for initializing Fink (when I’m on OS X) is in another, etc. Oh, and switching between LAM and OMPI is trivial.

Anyway, life is now at least 2.74% better, all at the cost of a weekend of productivity…

Somewhere in this building is our talent

Back in B’ton

I have returned to B’ton. It’s a bit hot and a bit humid, which is not something I enjoy. On the other hand, there is green and hills, which are good. So you win some and you lose some. Soon enough I’ll even have a desk I can work at and all that.

Moving Sucks!

So I had a moving company get all my stuff from LA to B’ton. I figrued that my stuff would come out better for it and would lessen the amount of crap I had to deal with. And there was that whole WWDC thing, of course. Well, it turns out that I was completely wrong. Most of my stuff is worse off for the wear. The computer desk is completely destroyed, with the glass top just not making the trip. The TV looks like it spent 4 years in a frat house. Oh, and everything showed up a week later than it was supposed to (it took almost one month to get my stuff from LA to B’ton). So I’m not really happy with the moving company right now. Hopefully, they do better on the reimbursement side of things.

WWDC

Went to Apple’s World Wide Developers Conference the last week of June. It was pretty interesting. They announced some cool features of their next operating system that should make life better. Now they just need to ship the damn thing. Well, and fix some of the bugs I (and I’m sure many others) have found from their prerelease. But it’s all good. It’s Apple – of course the final release is going to work.

Hey everybody, Mr. Nude has an opinion

T-minus 21 days

The countdown to final endex at ISI is 21 days. In that time, I have to help bring a cluster online, get some performance numbers for the mesh router code, finish a paper for I/ITSEC, and actually quit working. Nothing much. 🙂 I’m not sure how I feel about leaving. I really like working with the guys at ISI, but I think getting a PhD is the right choice. So off I go, to WWDC thankfully.

No malloc() for you. Only sbrk().

Apparently there are some failures with LAM and it’s internal ptmalloc code, the Portland Group compilers, and the AMD Opteron platform. Of course, we don’t have one of those and don’t usually test that configuration. Someone lent us some accounts, so that helps. But it’s still a pain. I hope there isn’t something borked in there – that would be a pain in the butt. Something tells me I’m going to have to import ptmalloc2 and see if that makes the problem better. Which would really be a pain, as LAM 7.1 is actually approaching stabilization. Jeff’s getting goodness on gm. Vishal is looking into the remaining IB bug (which seems to be one of those fluttering things that’s hard to find). And I’m doing configuration, build, and documentation bugs. As well as a bunch of little bugfixes that should make everyone’s life easier.

The automobile should be a time-saving device

The Civic decided to be a little brat lately. It’s had a dead battery twice in the last two weeks, which is a major pain. I’m not sure what the deal is – I think the first failure was because I left a light on in the car. Not sure what the cause of the second failure was, but it can’t be good. Oh well, I’ll have them take a look when I take the car in for service next week. In addition, the stupid windshield cost me $250 to replace after shattering from a rock on 580 a couple of weeks ago. That’s just not cool. Ah well, I suppose the car is just in its terrible twos. Not much I can do about that ;).

Because if it isn’t heavy, it isn’t a computer

I was bored during the last trip to San Diego and spent a good day trolling around on ebay while the event progressed without issue (definitely not complaining about the lack of issue – it makes me look good that I can sit there and do nothing). Anyway, I ended up buying a Sun E250 server. It has two 250MHz processors, 512MB ram, and 6 9GB UltraSCSI drives. Because Solaris doesn’t come with a software RAID stack by default, I think it is going to run Linux for a while. Still deciding exactly what to do with it. Definitely going to be a server somewhere, and the RAID setup is going to make it appealing for use as a cvs/svn repository and all that. The silly thing has redundant power supplies and is rack mountable, so it weighs like 100lbs. Glad I don’t have to worry about getting it back to Bloomington.

On the lighter computing side, I need to find a cheap Apple machine that can run OS X for use at the apartment. I want to try to have a media server hooked up for streaming MP3s and the like. I’m thinking a nice Bluetooth presentation mouse and a VGA to TV converter will make for a fun little toy. But this is probably going to take a little while to configure, since I’m going to have to be poor for a little while. I’d rather just buy the TiVo Home Media option or one of the streaming servers, but none of them seem to support Apple’s FairShare encrypted AAC files (the things you buy from the iTunes Music Store). And I want only one set of playlists, if I can do it.

Bluetooth is good

I decided it was time to graduate from the one button apple mouse I’ve been using at home to a real mouse. Having a wire connected to the mouse seemed silly, so I bought a Microsoft Bluetooth wireless mouse. It’s so nice not having the mouse cord getting in the way – almost enough to make me think about buying a wireless keyboard. If I had the built-in bluetooth on my laptop, I most definitely would.