Parallel Computing

Description: The frequency of a core in modern processors has been stagnating for more than a decade now. Hence, it becomes unavoidable to use multiple cores and computers in parallel to accelerate a computation. We start this course with a tour of C++, as it is the go-to language for many high-performance framework and scientific applications. We then study the class of embarrassingly parallel algorithms, using the C++ standard library (std::threads) and the OpenMP specification language. Afterwards, we take a deep dive into the peculiarities and challenges of nondeterminism with shared memory parallelism. We terminate the course with a focus on message passing using MPI.

This course is given in the Master in High Performance Computing at the University of Luxembourg. On this page, you can access all the recorded videos and laboratories. For the students in the MHPC, the course is self-paced—they unlock videos and labs progressively—and the laboratories are reviewed for correctness, performance and coding style. Students keep track of their progresses here. We assume programming knowledge.

Crash Course in C++

Recorded Lectures

  1. Compile and run your first C++ program
  2. Primitive types
  3. Expressions
  4. Statements
  5. Functions and references
  6. Pointers
  7. Classes
  8. Memory allocation
  9. Operator overloading
  10. Template
  11. Move semantics
  12. Lambda functions
  13. Type inference
  14. Metaprogramming
  15. C language

C++ Laboratories

  1. Cellular Automaton: Rule 110 [pdf]
  2. Vector Data Structure [pdf]
  3. Functional Programming in C++ [pdf]

Group Works

  1. Design a smart pointer [pdf]

Parallel Computing

Recorded Lectures

    Private Read/Write Memory Model
  1. HPC top-down
  2. Benchmarking
  3. Easy acceleration
  4. Arithmetic intensity
  5. Multithreading in theory
  6. Multithreading in practice
  7. Static decomposition
  8. Load imbalance
  9. Fork-join model
  10. Privatization
  11. Synchronization with barrier
  12. OpenMP
  13. False sharing
  14. Shared Read/Write Memory Model
  15. Why shared read/write memory?
  16. Atomic
  17. Compare-and-swap
  18. Task parallelism
  19. Dynamic task scheduling
  20. Task parallelism with OpenMP
  21. Producer-consumer problem
  22. Dining philosophers problem
  23. Memory Consistency
  24. Sequential consistency
  25. Litmus tests
  26. x86 Total Store Order (TSO)
  27. Arm/Power Relaxed Memory Consistency
  28. Data-race-free Sequential Consistency
  29. Happens-before Relation

Laboratories

  1. Rule 110: Three Ways to Parallelize [pdf]
  2. Rule 110: Faster [pdf]
  3. Task Scheduler [pdf]
    (from Parallel Computing@Stanford University)
  4. Parallel Interval Propagation [pdf]

Group Works

  1. Parallel algorithm to count atoms [pdf]
  2. Processor architectures [pdf]
    (from Parallel Computing@Stanford University)
  3. Parallel algorithm to count patterns in a string [pdf]
  4. Readers-writers problem [pdf]
  5. Memory consistency [pdf]

Video credits: recorded with Paul Aromolaran, edited by Wei Huang. Thanks to the media center@uni.lu for their assistance in this project!