Edison is a Cray XC30, with a peak performance of 2.57 petaflops/sec, 133,824 compute cores, 357 terabytes of memory, and 7.56 petabytes of disk. Hosted at NERSC, it offers access to many researchers and practitioners in the US.
Installing the latest versions of Open MPI (including the development, unstable and stable) on this machine can easily be done by using the provided platform file. The same cannot unfortunately be said about the current ULFM version. This version was based on an older unstable version of Open MPI (1.7) and didnt backported all the new exciting features from the development branch of Open MPI. As a result, getting it to work on Edison is a little challenging, but we all like a little challenge.
Preparing for installation
This step is unfortunately required, as I couldnt find any other way to update the m4 file of our ancient ULFM installation, without dedicating too much time to such a hopeless task (especially as the new ULFM 2.0 is reaching almost stable status). The issue is triggered by a misconfiguration of autoconf 2.69 on Edison, which breaks all the autotool chain. The issue can be easily corrected by downloading the Mercurial version of ULFM on another computer, running autogen.sh, a quick configure with no arguments, followed by “make dist”. This will generate 2 archives (openmpi-1.7.1ULFM.tar.bz2 and openmpi-1.7.1ULFM.tar.gz). Use the one corresponding to your preferred compression algorithm, and move it on Edison.
Compiling from source
With the archive moved on Edison, you can now undergo the last step to configure and compile ULFM there. Lets assume we want to build an optimized version, integrated with Edison resource management software (Slurm), using uGNI, without debugging information. One of the interesting features ULFM inherited from Open MPI is the capability of using platform files. Here is the platform file for Edison, it will eventually make its way into the official ULFM release, meanwhile you can copy and paste directly from here.
enable_visibility=no enable_static=no enable_shared=yes with_threads=no enable_pretty_print_stacktrace=no enable_dlopen=no with_portals_config=redstorm enable_mca_no_build=coll-hierarch,pml-dr,pml-v enable_contrib_no_build=libnbc,vt with_rte_support=yes enable_heterogeneous=no enable_pty_support=no enable_mem_debug=no enable_mem_profile=no enable_binaries=yes enable_script_wrapper_compilers=yes ompi_cv_c_word_size_align=no enable_mpi_ext=ftmpi with_ft=mpi with_openib=no with_devel_headers=yes with_alps=no with_slurm=yes with_xpmem=/opt/cray/xpmem/default with_pmi=/opt/cray/pmi/default with_cray_pmi2_ext=yes with_ugni=/opt/cray/ugni/default with_ugni_includedir=/opt/cray/gni-headers/default/include with_io_romio_flags=--with-file-system=ufs+nfs with_memory_manager=ptmalloc2 with_valgrind=no
With this platform file saved as platform.edison
you can now proceed to configure ULFM using “./configure —with-platform=platform.edison …”. There are many other possible configuration parameters, and you are strongly encouraged to read the output of configure —help
to gain more insight into the flexible configuration system. Meanwhile here is what I usually add in addition to the platform file “—prefix=… –enable-mpirun-prefix-by-default” (please replace the dots with your prefered installation directory). Keep in mind that the installation directory must be in your $PATH, to have direct access to the wrapper compilers and to the mpiexec.
Preparing for execution
In general we suggest you to use a specific MCA file (the -am
parameter) to provide the parameters for your ULFM runs (and not collide with the normal Open MPI runs). As long as you use mpiexec
to start your application this will work. If instead you want to rely on srun to so do, you will have to dump the context of your AMCA file directly into your main ${HOME}/.openmpi/mca-params.conf
file (and pay attention when you execute normal Open MPI runs).
Running ULFM applications in an interactive job
Congratulations, the hardest steps are now done and you are now supposed to have a fully compiled and ready to run version of ULFM, and its binaries in your $PATH
. Allocating an interactive job on Edison is well described in the online documentation. There is however a trick, you need to add an option to prevent Slurm from destroying your job when one of the processes disappear. This option is —no-kill
.
Thus, allocating an interactive job with 4 processes for a short duration can be achieved with “`salloc -p debug -N 4 –no-kill“.
Running an application then become easy, as indicated in ULFM setup steps.
Survival readiness
You code has now a sane execution environment that is able to survive faults, and can help your application achieve the same goal.