If you are interested in doing one of the projects below, please contact me!

For this project, we apply auto-tuning on GPUs. We have several GPU applications where the absolute performance is not the most important bottleneck for the application in the real world. Instead the power dissipation of the total system is critical. This can be due to the enormous scale of the application, or because the application must run in an embedded device. An example of the first is the Square Kilometre Array, a large radio telescope that currently is under construction. With current technology, it will need more power than all of the Netherlands combined. In embedded systems, power usage can be critical as well. For instance, we have GPU codes that make images for radar systems in drones. The weight and power limitations are an important bottleneck (batteries are heavy).

In this project, we use power dissipation as the evaluation function for the auto-tuning system. Earlier work by others investigated this, but only for a single compute-bound application. However, many realistic applications are memory-bound. This is a problem, because loading a value from the L1 cache can already take 7-15x more energy than an instruction that only performs a computation (e.g., multiply).

There also are interesting platform parameters than can be changed in this context. It is possible to change both core and memory clock frequencies, for instance. It will be interesting to if we can at runtime, achieve the optimal balance between these frequencies.

We want to perform auto-tuning on a set of GPU benchmark applications that we developed.

Spark only has a concept of compute nodes, and of data locality within a rack. We want to extend this to hierarchical systems, that also include multiple clusters on different continent. However, the high latencies and limited wide-area bandwidth make this challenging. Can we adapt Spark’s runtime systems and scheduling algorithms to deal with this?

For this project, we apply auto-tuning on GPUs. Here, many applications are not compute-bound, but memory-bound. Most existing auto-tuning research optimizes parameters that are related to compute performance. So, for many applications, the issue where the bottleneck is, is not optimized. Therefore, we want to investigate if we can increase performance by tuning memory-specific parameters, such as how GPU shared memory and registers are used, or whether a data structure is stored as a structure-of-arrays, or an array-of-structures. We have many real applications that we can use to test our hypotheses, from astronomy, climate research, digital forensics, imaging, etc.

Our hypothesis is that almost all applications can be written in a divide-and-conquer style. The question is, however, if they are efficient and scalable in that form. Therefore, we want to investigate different classes of applications to test this. Berkeley university came up with the concept of dwarfs: a dwarf is an algorithmic method that captures a pattern of computation and communication.

This is the list of Dwarfs:

Dense Linear Algebra

Sparse Linear Algebra

Spectral Methods

N-Body Methods

Structured Grids

Unstructured Grids

MapReduce

Combinational Logic

Graph Traversal

Dynamic Programming

Backtrack and Branch-and-Bound

Graphical Models

Finite State Machines

We would like to have one or more real application for each dwarf, implemented sequentially, in Satin, and also in a traditional model, such as MPI or Spark (for data intensive applications). We can then compare and evaluate the effectiveness of the divide-and-conquer model, and see if the extensions that make Satin more expressive cover all cases.

In addition, we would like to implement a general program so that users can use our tool as a real benchmark (i.e. you launch a single binary, it tunes everything and just gives you benchmark results in output).

Here, you can find a list of master projects I supervised in the past.