Bachelor and Master projects


If you are interested in doing one of the projects below, please contact me!

Master projects

Auto-Tuning for power

Auto-tuning is a well-known optimization technique in computer science. It has been used to ease the manual optimization process that is traditionally performed by programmers, and to maximize the performance portability. Auto-tuning works by just executing the code that has to be tuned many times on a small problem set, with different tuning parameters. The best performing version is than subsequently used for the real problems. Tuning can be done with application-specific parameters (different algorithms, granularity, convergence heuristics, etc) or platform parameters (number of parallel threads used, compiler flags, etc).

For this project, we apply auto-tuning on GPUs. We have several GPU applications where the absolute performance is not the most important bottleneck for the application in the real world. Instead the power dissipation of the total system is critical. This can be due to the enormous scale of the application, or because the application must run in an embedded device. An example of the first is the Square Kilometre Array, a large radio telescope that currently is under construction. With current technology, it will need more power than all of the Netherlands combined. In embedded systems, power usage can be critical as well. For instance, we have GPU codes that make images for radar systems in drones. The weight and power limitations are an important bottleneck (batteries are heavy).

In this project, we use power dissipation as the evaluation function for the auto-tuning system. Earlier work by others investigated this, but only for a single compute-bound application. However, many realistic applications are memory-bound. This is a problem, because loading a value from the L1 cache can already take 7-15x more energy than an instruction that only performs a computation (e.g., multiply).

There also are interesting platform parameters than can be changed in this context. It is possible to change both core and memory clock frequencies, for instance. It will be interesting to if we can at runtime, achieve the optimal balance between these frequencies.

We want to perform auto-tuning on a set of GPU benchmark applications that we developed.

Distributed Spark

Apache Spark is a system for large-scale data processing used for Big Data applications business applications, but also in many scientific applications. For large and complex scientific applications, different data sets are often combined. For instance, in weather and climate research, we combine data from sensors, remote sensing (satellites), and simulations. In astronomy, we can combine observational data generated by different telescopes, observing in different frequencies. Data sets that must be combined are often distributed over different clusters, institutes and continents.

Spark only has a concept of compute nodes, and of data locality within a rack. We want to extend this to hierarchical systems, that also include multiple clusters on different continent. However, the high latencies and limited wide-area bandwidth make this challenging. Can we adapt Spark’s runtime systems and scheduling algorithms to deal with this?

Bachelor projects

Memory-centric auto-tuning on GPUs

Auto-tuning is a well-known optimization technique in computer science. It has been used to ease the manual optimization process that is traditionally performed by programmers, and to maximize the performance portability. Auto-tuning works by just executing the code that has to be tuned many times on a small problem set, with different tuning parameters. The best performing version is than subsequently used for the real problems. Tuning can be done with application-specific parameters (different algorithms, granularity, convergence heuristics, etc) or platform parameters (number of parallel threads used, compiler flags, etc).

For this project, we apply auto-tuning on GPUs. Here, many applications are not compute-bound, but memory-bound. Most existing auto-tuning research optimizes parameters that are related to compute performance. So, for many applications, the issue where the bottleneck is, is not optimized. Therefore, we want to investigate if we can increase performance by tuning memory-specific parameters, such as how GPU shared memory and registers are used, or whether a data structure is stored as a structure-of-arrays, or an array-of-structures. We have many real applications that we can use to test our hypotheses, from astronomy, climate research, digital forensics, imaging, etc.

Divide-and-conquer for the Berkeley Dwarfs

Divide-and-conquer is an interesting parallel programming model that has very attractive properties, such as good scalability, and cache obliviousness. It only relies on splitting the work in several pieces, which are then recursively split again, until the sub-problems are small and easy enough to be solved. Next, the sub-problems are combined again to arrive at the final solution. One well-known example of this model is map-reduce. We have developed a more generic Java-based system called Satin, that can run divide-and-conquer applications on distributed platforms such as clusters, grids and clouds. Satin transparently takes care of communication, work distribution, load balancing, and fault-tolerance, making it extremely easy to use. In addition, Satin extends the divide-and-conquer model with shared objects and support for speculative parallelism. More information on Satin can be found here.

Our hypothesis is that almost all applications can be written in a divide-and-conquer style. The question is, however, if they are efficient and scalable in that form. Therefore, we want to investigate different classes of applications to test this. Berkeley university came up with the concept of dwarfs: a dwarf is an algorithmic method that captures a pattern of computation and communication.

This is the list of Dwarfs:

Dense Linear Algebra
Sparse Linear Algebra
Spectral Methods
N-Body Methods
Structured Grids
Unstructured Grids
MapReduce
Combinational Logic
Graph Traversal
Dynamic Programming
Backtrack and Branch-and-Bound
Graphical Models
Finite State Machines

We would like to have one or more real application for each dwarf, implemented sequentially, in Satin, and also in a traditional model, such as MPI or Spark (for data intensive applications). We can then compare and evaluate the effectiveness of the divide-and-conquer model, and see if the extensions that make Satin more expressive cover all cases.

Fast data serialization and networking for Spark

Apache Spark is a system for large-scale data processing used for Big Data applications business applications, but also in many scientific applications. Spark uses Java (or Scala) object serialization to transfer data over the network. Especially if data fits in memory, the performance of serialization is the most important bottleneck in Spark applications. Spark currently offers two mechanisms for serialization: Standard Java object serialization and Kryo serialization. In the Ibis project (www.cs.vu.nl/ibis), we have developed an alternative serialization mechanism for high-performance computing applications that relies on compile-time code generation and zero-copy networking for increased performance. Performance of JVM serialization can also be compared with benchmarks: https://github.com/eishay/jvm-serializers/wiki. However, we also want to evaluate if we can increase Spark performance at the application level by using out improved object serialization system. In addition, our Ibis implementation can use fast local networks such as Infiniband transparently. We also want to investigate if using specialized networks increases application performance. Therefore, this project involves extending Spark with our serialization and networking methods (based on existing libraries), and on analyzing the performance of several real-world Spark applications.

The Auto-Tuning Benchmark

We created a set of benchmark applications in OpenCL, that run on many different hardware platforms, such as CPUs, GPUs, the Xeon Phi, etc. We use these codes to test performance and power efficiency of applications and platforms. In addition, we have an auto-tuning framework that uses a structured approach to perform large-scale auto-tuning. All OpenCL applications we have are made tuneable: they expose a set of application-specific parameters that can be automatically tuned. However, we would like to have more OpenCL applications, to make sure we have a representative set.

In addition, we would like to implement a general program so that users can use our tool as a real benchmark (i.e. you launch a single binary, it tunes everything and just gives you benchmark results in output).

Here, you can find a list of master projects I supervised in the past.