MPI

Open MPI is an open source implementation of MPI (message-passing interface), the industry-standard specification for writing message-passing programs. Message passing is a programming model that gives the programmer explicit control over interprocess communication.


Current Version Table

Module Best Partition(s) Invocation
openmpi/openmpi-3.1.3 z10ph-hpc3, r630-hpc3, hpc4, hpc4rc, hpc5 srun

Old Version Table

Module Best Partition(s) Invocation
openmpi/openmpi-3.1.2-qlogic* z10ph-hpc3, r630-hpc3, hpc4, hpc4rc srun
openmpi/openmpi-3.1.2-mellanox** hpc5 srun
openmpi/openmpi-3.1.0 z10ph-hpc3, r630-hpc3, hpc4, hpc4rc srun
openmpi/openmpi-2.1.3 z10ph-hpc3, r630-hpc3, hpc4, hpc4rc srun
openmpi/openmpi-2.1.2 and lower z10ph-hpc3, r630-hpc3, hpc4, hpc4rc mpirun
  • * May be removed in the future
  • ** Replaced by openmpi/openmpi-3.1.3

Device for Partition Table

Partition Device
*hpc5* mlx5_3:1
*hpc4* qib0:1
*hpc3* qib0:1

NOTE

OpenMPI should be robust enough to fall back to using the 'normal' network interfaces if all else fails, but there will be a performance hit.

NOTE

For MPI code compiled with OpenMPI version 1.10.2 - 2.1.1 on the Lewis cluster, said MPI code should not but run with srun, but rather with mpirun inside of an SBATCH script. Use of srun may result in errors.

NOTE

For MPI code compiled with OpenMPI version 2.1.3 and greater on the Lewis cluster, said MPI code should ONLY be run with srun.

OpenMPI 3.1.3 on Lewis

Please note that for OpenMPI 3.1.3, loading the module outside of a Slurm allocation will result in a the following warning message:

===== WARNING
Loading this module outside of a slurm allocation requires you to set
the environment variable 'OMPI_MCA_btl_openib_if_include' to the correct
value for the partition you are submitting to.
See http://docs.rnet.missouri.edu/Software/mpi for more information.
=====

The openmpi/openmpi-3.1.3 module file attempts to determine which partition you are running your MPI job on in order to set the correct value of OMPI_MCA_btl_openib_if_include. If your workflow includes loading this module outside of a Slurm allocation, it is up to you to include logic in your job to set OMPI_MCA_btl_openib_if_include correctly.

To silence this warning, use the module within a Slurm allocation OR set the environment variable OMPI_MCA_btl_openib_if_include to the correct value for the partition you are submitting work to. See Device for Partition Table and Usage for more information.

Also, it has come to our attention that some MPI workflows on hpc5 nodes are exhibiting issues where the MPI processes hang, producing no output. If you experience an issue like this, please try setting the following within your workflow:

## This should go before calling your MPI aware code
export OMPI_MCA_btl=openib

Usage on Lewis

In this example, we will compile and run an OpenMPI program utilizing the resources of the Lewis cluster. While MPI can be implemented with many languages, this example is written in C, utilizing the mpi.h library. On the Lewis cluster, MPI allows tasks specified by --ntasks to share memory. In effect, this allows a single piece of code to run many processes on many machines, and exchange data with each other. Remember, --nodes defines total allocated nodes, while --ntasks defines the number of tasks initiated across those nodes. When running OpenMPI code on the Lewis cluster, this means that the number --ntasks will be equal to the number of MPI ranks. If you define more tasks --ntasks than combined cores across --nodes nodes, the job submission will fail.

As we will be compiling our MPI code via srun on a node in a hpc5 partition, we will set OMPI_MCA_btl_openib_if_include to mlx5_3:1 before loading the module openmpi/openmpi-3.1.3.

[user@cluster ~]$ export OMPI_MCA_btl_openib_if_include='mlx5_3:1'
[user@cluster ~]$ module load openmpi/openmpi-3.1.3

Now, using srun, we compile our mpi_hello_world.c:

[user@cluster ~]$ srun --partition hpc5 --nodes=1 --ntasks=1 mpicc -o mpi_hello mpi_hello_world.c

Finally, we can submit the example SBATCH:

[user@cluster ~]$ sbatch ./mpi_batch.sh

If all went well, you should get something like the following output in your results file:

Currently Loaded Modulefiles:
  1) xz/xz-5.2.3                        5) libpciaccess/libpciaccess-0.13.5
  2) zlib/zlib-1.2.11                   6) hwloc/hwloc-1.11.9
  3) libxml2/libxml2-2.9.4              7) openmpi/openmpi-3.1.3
  4) numactl/numactl-2.0.11
Hello world from processor lewis4-r640-hpc5-node841, rank 2 out of 4 processors
Hello world from processor lewis4-r640-hpc5-node841, rank 3 out of 4 processors
Hello world from processor lewis4-r640-hpc5-node841, rank 1 out of 4 processors
Hello world from processor lewis4-r640-hpc5-node841, rank 0 out of 4 processors

Contents of Example Files

mpi_sbatch.sh:

#!/bin/bash
#-------------------------------------------------------------------------------
#  SBATCH CONFIG
#-------------------------------------------------------------------------------
## resources
#SBATCH --partition hpc5
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --mem-per-cpu=1G
#SBATCH --time 0-00:05:00
#SBATCH --job-name=example_mpi_job
#SBATCH --output=results-mpi-%j.out
#-------------------------------------------------------------------------------

# Load your modules here:
module load openmpi/openmpi-3.1.3
module list

# Science goes here:
srun ./mpi_hello

mpi_hello_world.c:

// Author: Wes Kendall
// Copyright 2011 www.mpitutorial.com
// This code is provided freely with the tutorials on mpitutorial.com. Feel
// free to modify it for your own use. Any distribution of the code must
// either provide a link to www.mpitutorial.com or keep this header intact.
//
// An intro MPI hello world program that uses MPI_Init, MPI_Comm_size,
// MPI_Comm_rank, MPI_Finalize, and MPI_Get_processor_name.
//
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
  // Initialize the MPI environment. The two arguments to MPI Init are not
  // currently used by MPI implementations, but are there in case future
  // implementations might need the arguments.
  MPI_Init(NULL, NULL);

  // Get the number of processes
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Get the rank of the process
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  // Print off a hello world message
  printf("Hello world from processor %s, rank %d out of %d processors\n",
         processor_name, world_rank, world_size);

  // Finalize the MPI environment. No more MPI calls can be made after this
  MPI_Finalize();
}