Simple MPI parallelism #
In this exercise we’re going to compute an approximation to the value of π using a simple Monte Carlo method. We do this by noticing that if we randomly throw darts at a square, the fraction of the time they will fall within the incircle approaches π.
Consider a square with side-length $2r$ and an inscribed circle with radius $r$.
The ratio of areas is
$$ \frac{A_\text{circle}}{A_\text{square}} = \frac{\pi r^2}{4 r^2} = \frac{\pi}{4}. $$
If we therefore draw $X$ uniformly at random from the distribution $\mathcal{U}(0, r) \times \mathcal{U}(0, r)$, then the probability that \(X\) is in the circle is
$$ p_\text{in} = \frac{\pi}{4}. $$
We can therefore approximate π by picking $N_\text{total}$ random points and counting the number, $N_\text{circle}$, that fall within the circle
$$ \pi_\text{numerical} = 4 \frac{N_\text{circle}}{N_\text{total}} $$
Obtaining the code #
The code for this exercise lives in the code/parallel/pi/
subdirectory in the repository, as pi.py
.
Working from the repository
I provide a simple serial implementation that uses numpy to generate the random numbers.
Exercise
Adapt the code so that it runs for a range of different choices of the number of samples,
N
. Plot the error in the estimated value of $\pi$ as a function ofN
.What relationship do you observe between the accuracy of the approximate result and
N
?Hint
A double-precision approximation to $\pi$ is available asnumpy.pi
.
Parallelisation with MPI #
We’re now going to parallelise this computation with MPI.
If you’re running on Hamilton don’t forget to load the right modules
python/3.6.8 gcc/8.2.0 intelmpi/gcc/2019.6
The code already imports mpi4py
, but does not distribute the work.
Exercise
Adapt the
run
function so that the total samples are (approximately) evenly distributed between all the ranks in the given communicator.You’ll now have each process computing a partial answer, so combine them with
MPI_Allreduce
.
Parallelising the random number generation #
Running this code in parallel presents us with a slight problem. We need to think about how to provide statistically independent random number streams on the different processes.
Fortunately, modern versions of numpy have us covered. Their documentation describes how to obtain random numbers in parallel.
Exercise
Replace the use of thedefault_rng
generator with a Generator that will produce a different stream on each process. Remember that thecomm.rank
is unique to each process in the communicator.
Benchmarking #
We’ll now briefly look at how this code scales, by carrying out some simple strong and weak scaling tests.
Exercise
Use
MPI.Wtime()
to measure the length of time therun
function takes on each process.Use the maximum over all processes for your plots.
- Produce a strong scaling plot using a total of $N=10^8$ points.
- Produce weak scaling plots using $N=10^4$, $N=10^5$, $N=10^6$, and $N=10^7$ points per process.
What observations do you make about the scaling behaviour?