2020/21
This course page was updated until March 2022 when I left Durham University. The materials herein are therefore not necessarily still in date.

Annotated slides 2020/21 edition #

Slides as annotated during the live lectures. Recordings of the live sessions are available if you’re appropriately logged in. If you think you should have access but don’t, please get in touch.

  • Session 1, video.

  • Session 2, video.

  • Session 3, and the roofline paper, video part I and part II.

  • Session 4, video. I got a bit confused towards the end in the exercises determining why likwid was reporting load counts different from those we were expecting. We worked this out (thanks Finlay) by the next session: the compiler was inlining some code and producing some extra moves.

  • Session 5, video.

  • Session 6, video. I went over the end of the cache blocking for matrix-matrix multiplication again, and then we looked briefly at the performance we obtained on Hamilton. We then looked at how we can convince our compiler to generate the code we know is right. As mentioned, in my experience, of Intel, GCC, and Clang, the Intel compiler is the best in terms of optimisation reports. Flags for Intel are in the slides. For GCC, we can get some reports with -fopt-info (see this page and search for -fopt-info). For Clang, we can get some reports with -Rpass=vec and -Rpass-missed=vec (see this page).

  • Session 7, I also annotated one page from the BLIS paper, video.

    The two articles I took examples from for the dimension-lifted transposition approach are Henretty et al. (2011), Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures, and Boyle et al. (2015), Grid: A next generation data parallel C++ QCD library. One example of the same approach applied to unstructured problems is shown in Sun et al. (2020), A study of vectorization for matrix-free finite element methods.

  • Session 8, video

    We did a walkthrough of doing a roofline analysis. I got a bit confused with the data in the plots from the slides, so we did it live with the BLIS GEMM implementation (see exercise 9). Those are the drawn-in points in the annotated slides.

    I then looked at application of some of the ideas in the course to the Firedrake finite element system. We did the dimension-lifted transposition approach on unstructured data (rather than the structured grids that we looked at last time).

    Finally we tried to build the coursework code to check that it will work when you do it. Only to find it didn’t, so I updated the PETSc install on Hamilton and rebuilt things, so I hope that the same approach should work for you. Please get in touch if it doesn’t (or there are things that you don’t understand).