Tag Archives: mpi

Much MPI Goodness

LAM, LAM, LAM

LAM 7.1 escaped from the developers this morning. New in LAM 7.1 is the ability to have dynamic shared objects for SSI components, an InfiniBand RPI, support for the SLURM batch scheduler, and much more. Basically, much goodness all in all. If you use LAM, you should download it right now. Onward with world domination!

Open MPI

A freak meeting of minds over dinner occured about a year ago between a bunch of independent MPI developers. The result was the decision to combine forces and produce the one MPI to bind them all. After numerous discussions, the name Open MPI was forged, and a web page was created. The web page is, of course, http://www.open-mpi.org/.

Threading Plans

Jeff and I had a mad planning session this morning on future of LAM with regards to threading. We both have realized that we need to have some plan in place for how we are going to thread LAM, even if it isn’t implemented for a very, very long time. Part of this is needed right now. We would like to avoid having to redo any current work on the lam daemons later because of some poor design choice now. With that in mind, our threading model. Note that this is just for Trollius layer, and the MPI layer really hasn’t been planned out yet. Which is fine, because we think we at least understand what is needed to thread the MPI layer, and can keep ourselves out of trouble there…

We plan on supporting four levels of threading eventually: SINGLE, USER, LAM, and LAM_DEAMON.

The SINGLE is exactly what we have right now. Although there can be more than one thread in the user program, only the initial thread can make any calls to the Trollious layer. It is up to the user to ensure that there are no “badness” situations, such as posting an frecvfront() and then posting another frecvfront() before posting the frecvback().

USER implies that the user will take care of any synchronization issues (such as the mentioned frecv*() issue, but the user IS allowed to make Trollius calls from more than one thread. This model is obviously more powerful than the SINGLE method, but still puts the requirement that the user handle all synchronization, which is sub-optimal. Further, it is quite obvious that a poorly behaving program could enter into deadlock, and potentially dead-lock the LAM daemon (although this is unlikely, and not known to be possible).

LAM will not be implemented in the first iteration. It is more for planning at this point than anything else. Eventually, the idea is to make all the various Trollius sends and receives be fully MT hot. For the sends, this will be fairly straight-forward. For the receives, this is actually complicated as shit. I don’t think we will be able to obtain the level of threading that we (both Jeff and I) would like to see without major changes to how the kernel communicates with the user process. And since neither Jeff nor I seem to understand this model, I think there are some major problems there…

LAM_DAEMON is a special case for the lam pseudo-daemons that is not supposed to be used outside of the daemons. It will provide no synchronization on the sends and receives, but will provide a method for each thread to posses it’s own socket and kio structure. This will allow us to simulate the separate pseudo-daemons in a single process, without the need for hand threading that was required in the previous incarnation of the lam daemon.We don’t need any synchronization or protection on the sockets for communication because each thread will use it’s own socket and protect itself that way. We are only using threads to mimic separate processes for marketing reasons, and would like to keep the behavior as close as possible to the way separate processes behave

Of course, LAM_DAEMON means that each thread will need it’s own kio structure. Because of that, kio will begin to behave like errno does in an MT program on Linux… Some pointer magic will hide this change to the user (which is really good because it reduces all the places I might have to change the code). Also, the kernel socket descriptor will be moved from a global variable to a member of the global kio structure. In this way, we will allow LAM_DAEMON to ensure that each calling process uses the correct socket to write it’s stuff down the socket (or read it ;-).

Other changes… We are going to look into storing the pid in the kio structure rather than calling getpid() everywhere. This will allow us to hide the pid on systems like Linux where each thread has it’s own pid. While we are on the subject of pids… I also need to make sure that there is no use of getpid() in the lam daemons that is going to break with the new model for the lam daemons… If so, I might have to figure out a quick way around the problem, perhaps something in the new system / user flag we are going to add to the nmsg structure. We, of course, will need to invalidate the kio structure any time the user calls fork(), as the pid of the process has changed and the new process should not enjoy all the rights of the old process until everything is taken care of. That, and we don’t want two processes to have the same event. That would be bad.

kinit() will soon become a synonym for thread_kinit(SINGLE), and thread_kinit() will be the preferred way to do the whole init thing. It will ensure that the proper level of treading is used.

The decision was made to allow the user to either use LAM with threads or run separate pseudo-daemons. There will be no support for running LAM without threads and having one single pseudo-daemons. It is quite clear that the single-process model we have for the lam daemon is far from optimal right now. It is hard to maintain, requires many things be short-circuited, and worst of all, it does not behave like the separate pseudo-daemons (look at all the debugging I had to do to make the separate pseudo-daemons work…). So, that was out. We definitely realize that people might not want threading support or might not have the ability to use threads on their machines. Therefore, the easiest route out of the problem was taken: let them use the separate pseudo-daemons. They behave exactly like the threads will in the single pd model, and will be maintained because they make debugging a LOT easier (and I’m really going to be playing in the lam daemon the next couple months…).

We are also going to replace the died with the resendd, which will open a named Unix socket in the LAM directory, and wait for incoming connections. Anything received on the socket will be stored in a local buffer and sent back out using the normal nsend() and friends. This is mainly needed for the kenyad right now, which has a signal handler to catch SIGCHLD, but can not modify the tables it needs to because the tables might be in use. Unfortunately, there is a good chance the kenyad is blocking in a nrecv(), and there for will not clean it’s tables up for a very long time. Now, the signal handler will send a message down this special socket, which will cause the resendd to nsend() the message back to the kenyad, which will then be out of the nrecv() (because it got something sent to it) and clean up it’s tables. Avoiding any need for anything complicated to be added to the kenyad. Of course, this would all be a lot easier with threads, but we want to have the option for the user to compile LAM without thread support. Anyway, the resendd will only be able to deal with a single nmsg, no payload. This is just to make life a LOT easier… Further, it is up to the sending client to properly fill out the nmsg structure. resendd will simply call nsend() with the structure, nothing more.

[Update…] In talking with Jeff, I think I was wrong, the resendd will support a payload packet. This may open us up to screwing ourselves later, as it would be possible to fill the buffer on the socket, and that can only lead to pain and suffering, but who cares? It’s better to have the feature with some limitations than not have it at all…