Parallel Hello World
This course page was updated until March 2022 when I left Durham University. For future updates, please visit the new version of the course pages.

Hello, World! #

As with every programming course, the first thing we will do is compile and run a “Hello world” program. Actually we’ll do three. The goal of this is to familiarise you with the module system on Hamilton, as well as how to compile code. So take a look at the quickstart guide if you haven’t already.

A serial version #

Log in to Hamilton/COSMA load the relevant compiler modules

intel/xe_2018.2
gcc/9.3.0
intel_comp/2018
For the rest of description, I’ll just write “Hamilton”, but you should interpret that to mean Hamilton or COSMA depending on which you’re using. Where there is a particular difference, I’ll call it out.

Create a C file serial.c containing the below.

hello/serial.c
#include <stdio.h>

int main(void)
{
  printf("Hello, World!\n");
  return 0;
}

All of these code snippets live in the course repository, so rather than copying or downloading, you should clone the repository and work in the relevant code subdirectory.

We’re working the in the code/hello subdirectory for this exercise.

Having done that you should compile the code with icc

$ icc -o hello-serial serial.c

The creates an executable named hello-serial from the serial.c file.

Run it on the login node

$ ./hello-serial
Hello, World!

Next, as you saw in the quickstart guide, we should actually do our runs on the compute nodes. To do this, we need to create a submission script

hello/serial.slurm
#!/bin/bash

# 1 core
#SBATCH -n 1
#SBATCH --job-name="hello-serial"
#SBATCH -o hello-serial.%J.out
#SBATCH -e hello-serial.%J.err
#SBATCH -t 00:01:00
#SBATCH -p test.q

source /etc/profile.d/modules.sh

module load intel/xe_2018.2
module load gcc/9.3.0

./hello-serial

Since we only have a small job, we have requested the “test” queue. This has a fast turn-around, but does not allow you to run large simulations.

On COSMA you need to select the cosma queue (#SBATCH -p cosma) and you’ll need to select the right account for charging with #SBATCH -A ACCOUNT. This charging account is one that you should have been told about when registering, it is used to determine how much time you use.

For many more details, see the COSMA documentation.

Submit your job with

$ sbatch serial.slurm

After running, this should create two files named hello-serial.SOMENUMBERS.out and hello-serial.SOMENUMBERS.err

$ ls hello-serial.*.{err,out}
hello-serial.3186773.err  hello-serial.3186773.out

The numbers correspond to the job ID. You can see the contents of these by opening them, or cating them to the screen.

The error file should be empty:

$ cat hello-serial.3186773.err
# No output

The output file should contain the string Hello, World!

$ cat hello-serial.3186773.out
Hello, World!

An OpenMP version #

One of the parallelisation paradigms we will see in this course is shared memory parallelism, for which we use OpenMP. OpenMP is a specification for compiler directives and library routines that can be used to specify parallelism at a reasonably high level for Fortran, C, and C++ programs.

To compile OpenMP programs, we need the same modules as for the serial programs above. Our code now looks a little different.

hello/openmp.c
#include <stdio.h>
#include <omp.h>

int main(void)
{
  int nthread = omp_get_max_threads();
  int thread;
#pragma omp parallel private(thread) shared(nthread)
  {
    thread = omp_get_thread_num();
    printf("Hello, World! I am thread %d of %d\n", thread, nthread);
  }
  return 0;
}

If we try and compile this like we did before, we will get an error

$ icc -o hello-openmp openmp.c
openmp.c(8): warning #3180: unrecognized OpenMP #pragma
  #pragma omp parallel private(thread) shared(nthread)
          ^

/tmp/iccLQGI6H.o: In function `main':
openmp.c:(.text+0x2a): undefined reference to `omp_get_max_threads'
openmp.c:(.text+0x32): undefined reference to `omp_get_thread_num'

First the compiler warns us that it saw a #pragma that it did not recognise. Then the linker complained that there were two undefined functions.

OpenMP support is implemented in most modern compilers, but has to be explicitly requested. To do so we must add the additional flag -qopenmp to our compile command

$ icc -qopenmp -o hello-openmp openmp.c

Our submission script also looks a little different

hello/openmp.slurm
#!/bin/bash

# 1 node
#SBATCH --nodes=1
#SBATCH --job-name="hello-openmp"
#SBATCH -o hello-openmp.%J.out
#SBATCH -e hello-openmp.%J.err
#SBATCH -t 00:01:00
#SBATCH -p test.q

source /etc/profile.d/modules.sh

module load intel/xe_2018.2
module load gcc/9.3.0

export OMP_NUM_THREADS=2
./hello-openmp

We select the amount of parallelism by setting the OMP_NUM_THREADS environment variable before running the executable.

After submitting our job with sbatch we again get some output files

$ ls hello-openmp.*.{err,out}
hello-openmp.3186885.err  hello-openmp.3186885.out

We can inspect the contents as before with cat

$ cat hello-openmp.3186885.out
Hello, World! I am thread 0 of 2
Hello, World! I am thread 1 of 2

Note how the individual threads are numbered from zero and live in the interval [0, 2).

Question

Try changing the number of threads. What do you notice about the output?

An MPI version #

The other parallisation paradigm we will use is for programming distributed memory systems. We will use MPI for this. MPI is a specification for a library-based programming model. The standard specifies Fortran and C/C++ interfaces, and there are wrappers for many popular programming languages including Python and Julia.

To compile MPI programs, we need to load, in additional to the previous modules, the right MPI version. So execute

$ module load intelmpi/intel/2018.2
$ module load intel_mpi/2018

in addition to the other module load commands.

Our code again looks different

hello/mpi.c
#include <stdio.h>
#include <mpi.h>

int main(void)
{
  int rank, size, len;
  MPI_Comm comm;
  char name[MPI_MAX_PROCESSOR_NAME];
  MPI_Init(NULL, NULL);
  comm = MPI_COMM_WORLD;
  MPI_Comm_rank(comm, &rank);
  MPI_Comm_size(comm, &size);
  MPI_Get_processor_name(name, &len);
  printf("Hello, World! I am rank %d of %d. Running on node %s\n",
         rank, size, name);
  MPI_Finalize();
  return 0;
}

Notice how compared to the OpenMP version, there are no pragmas and the parallelism is not explicitly annotated, there are just calls to library functions from MPI.

If we try and compile with icc, we get errors.

$ icc -o hello-mpi mpi.c
mpi.c(2): catastrophic error: cannot open source file "mpi.h"
  #include <mpi.h>
                  ^

compilation aborted for mpi.c (code 4)

To compile this file, we need to tell the compiler about all the MPI-relevant include files and libraries to link. Since this is complicated, MPI library implementors typically ship with compiler wrappers that set the right flags. On Hamilton these are named mpicc (for the MPI wrapper around the C compiler), mpicxx (for the wrapper around the C++ compiler), and mpif90 (for the Fortran wrapper). Since we have a C source file, we should use mpicc

$ mpicc -o hello-mpi mpi.c

Running the executable is now also more complicated, we need to use mpirun to launch it. This takes care of allocating parallel resources and setting up the processes such that they can communicate with one another.

hello/mpi.slurm
#!/bin/bash

# 1 node
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --job-name="hello-mpi"
#SBATCH -o hello-mpi.%J.out
#SBATCH -e hello-mpi.%J.err
#SBATCH -t 00:01:00
#SBATCH -p par7.q

source /etc/profile.d/modules.sh

module load intel/xe_2018.2
module load gcc/9.3.0
module load intelmpi/intel/2018.2

mpirun ./hello-mpi

Notice this time how we specified the par7 queue (you can stick with the cosma queue on COSMA, although you need to change the number of tasks to 16). You can get information about all the available queues with sinfo. A summary of the types is also available with sfree.

On some systems you need to specify the number of processes you want to use when executing mpirun. However, on Hamilton, the metadata in the scheduler is used to determine the number of processes. Here we request 1 node and 24 tasks per node.

Hence you should only explicitly specify if you want to run with an amount of parallelism different to that specified in your submission script.

Question

Try running on two compute nodes, by changing the --nodes=1 line to --nodes=2. How many total processes do you now have? What do you notice about the node names?