Embedded Systems ArchitectureResearch Seminar

# Research Seminar

This page is under construction and only a subset of the Seminars is listed for now.

## 25.04.2019-10:00: "Design and implementation of parallel architecture for sobel edge detection algorithm for real-time applications". Javad Bahrami.

Edge detection is one of the fundamental and most important problems of lower level image processing. Among various edge detection algorithms, sobel algorithm is one of the most practical candidates due to its ability to counteract the noise sensitivity over simple gradient operators and its easier implementation. There are very different approaches for sobel edge detection algorithm; nonetheless, field programmable gate arrays (FPGAs) are one of the best candidates for implementation of image processing algorithms due to their flexibility and the parallel nature of hardware. In this presentation, the sobel edge detector algorithm is described briefly, and afterwards, pipeline architecture with efficient usage of memory for its implementation is introduced.

## 18.04.2019-10:00: "Polyhedral Perforation for Automatic Approximation Modeling". Daniel Maier.

Polyhedral compilation has been proven by many researchers as the most effective way to reason about loops, their optimization, and parallelization.

The transformations always focus on semantically identical applications, however, there is an emerging class of approximate applications which trade some accuracy for higher performance, e.g., speedup or reduction in energy consumption. We present a novel way to perform and model loop perforation and data perforation with an instruction-based perforation granularity for approximable applications using the polyhedral framework. We extend the polyhedral model and provide the tools needed to perform perforation. We present models for accuracy, performance, and energy.

## 11.04.2019-10:00: "An Efficient Lightweight Framework for Porting of Vision Algorithms on Embedded SoC". Apurv Ashish.

Introduction of digitalization or more precisely computer vision has changed the entire landscape of the automotive industry and made autonomous vehicles a reality. The basic requirement for such an automobile is an intelligent system, that can process, analyse and extract information from sensor inputs such as a camera, odometer. In the case of a camera as an input sensor, the continuous camera frames are processed on onboard embedded hardware to extract features and relevant information. The vision algorithms used for processing camera inputs are computation and memory extensive.

Since the vision algorithms are initially developed for PC architectures, the stringent restrictions of embedded devices such as low on-chip memory, DMA operations are not addressed. These issues are handled during the porting of the algorithm on an embedded device. For instance, TI’s (Texas Instrument) embedded platforms use BAM (Block Accelerator Manager) framework for integrating algorithms on embedded hardware. The framework being generic in nature has some notable drawbacks which result in higher stack memory usage, execution time, considerable porting-time and largely incomprehensible codebase. In this thesis, we propose a new framework for TI hardware Efficient-BAM (E-BAM), which results in a speedup of about 14.8%, lower stack usage (88%) and easily reconfigurable codebase. Further, a code-generator is implemented which exploits the modular nature of the E-BAM framework, thereby, reducing the porting time of algorithms on embedded devices by a significant factor.

## 14.03.2019-10:30: "Presentation of the H2020 project FORTIKA". Dr. Yacine Rebahi.

In our talk we will be presenting the H2020 project FORTIKA that aims at developing a robust, resilient and effective cybersecurity solution that combines hardware and software and which can be effortlessly tailored to each individual enterprise’s evolving needs.  One of the main issues being addressed in this project is how to improve the performance of the mentioned solution. Our work focusses on devising an Intrusion Prevention architecture called FORTISEC (40SEC), that is meant to operate in a completely softwarized as well as in an FPGA accelerated mode. Thereby, we will be presenting suitable algorithms and components towards the implementation of accelerated intrusion prevention. We will also be presenting a testbed being utilized for the implementation of 40SEC and its performance testing.

## 07.03.2019-15:30: "Validation the Power Measurement Accuracy of NVIDIA Management Library(NVML)". Kaijie Fan

Energy and power consumption are becoming critical metrics in the field of designing high performance systems. Coming along with it, green energy systems on GPUs have drawn more and more attention of researchers. One unavoidable path of improving energy efficiency is to get accurate power measurements by using different approaches. NVML provides the way to measure power without any additional hardware. However, the accuracy of reading power on Maxwell, Tesla and Pascal is not known, which will be validated in my current work. Since the work is under processing, I will briefly introduce the motivation and methodology, then discuss current process and future work.

## 28.02.2019-10:00: "Introduction to the Current Project". Janik Zimmermann.

This is a startup presentation for a project I'm working on with prof. Dr. Juurlink. The aim of the project is to evaluate a digital signal processing platform for Sundance DSP inc. As such, what is to be implemented may be chosen quite freely. The current idea is based on the paper “Riesz Pyramids for Fast Phase-Based Video Magnification (Neal Wadhwa, Michael Rubinstein, Frédo Durand, William T. Freeman) from 2014. Here previous phase-based motion amplification techniques from 2013 are further improved upon, making the first real time implementation possible. Their existing C++ implementation allows them to amplify a 640x400 pixel video at 25 frames per second on a decent computer. However, this is done using further down sampling of the image. Other than that, their program appears poorly optimized, using only a single CPU core. The goal of my project is to implement this algorithm using as much hardware acceleration as possible, allowing us to increase the resolution to something more enjoyable. This will be implemented by modifying an existing demo by Sundance, which shows a simple edge detection algorithm applied to a live video feed.

## 14.02.2019-10:00: "JIT Compilation of SPIR-V using LLVM". Moritz Patelscheck.

The recent introduction of the Vulkan API along with SPIR-V, an intermediate language for native representation of graphical shaders and compute kernels, promises cross-platform access, high-efficiency and better performance of GPU applications. Being able to execute SPIR-V on CPU is needed by simulators and emulators of these APIs. However, currently there is no such support. In this work, we develop a Just In Time (JIT) compiler using the LLVM framework that is able to execute SPIR-V on a CPU instead of GPU. SPIR-V intstructions are converted to LLVM IR counterparts and executed on the fly. In this midterm presentation, I’ll briefly introduce SPIR-V, describe our JIT compiler and discuss current progress and future work.

## 07.02.2019-10:00: "Predictable Thread Coarsening". Nicolai Stawinoga.

Thread coarsening on GPUs combines the work of several threads into one. We demonstrate how thread coarsening can be implemented as a fully automated compile-time optimisation. Our approach estimates the optimal coarsening factor based on a low-cost, approximate static analysis of cache line re-use and an occupancy prediction model. We evaluate two coarsening strategies on three different NVidia GPU architectures. For NVidia reduction kernels we achieve a maximum speedup of 5.08x, and for the Rodinia benchmarks we achieve a mean speedup of 1.30x over 8 of 19 kernels that were determined safe to coarsen.

## 31.01.2019-10:00: "Overview of FPGA-based Deep Neural Network techniques". Anastasiia Dolinina.

The main advantage of FPGAs is the ability to build reconfigurable solutions. In the overview I will talk about FPGA technology for the implementation of Deep Neural Networks (DNN). And then I will discuss the place of FPGA platforms among other solutions (CPU, GPU, DSP etc.), the advantages and disadvantages of the usage of FPGAs. Later I will show that one of the main obstacles of the usage of FPGAs is the limited amount of their internal resources, which leads to the limitation in processing throughput. I also will consider the basic ways of throughput optimization: weight encoding, batch processing and weights pruning. In the conclusion, I will discuss my proposal of throughput optimization.

## 24.01.2019-10:00: “Performance Counters based GPU Power Modeling using Machine Learning”. Markus Neu.

GPUs have become important computational units on mobile and embedded devices. As power consumption is considered a critical factor on these devices, it is more and more important to be able to estimate and understand how GPU applications behave in terms of power.

In this thesis a power approximation for an embedded device had been intended by using a neural network. The first step towards the NN was the collection of usable data from the given device. These were based on the devices own hardware counter and external measured power consumption characteristics. In the following a small NN had been created which was repeated evaluated and altered to achieve the best error.

## 17.01.2019-10:00: "Understanding Sources of Inefficiency in Gerneral-Purpose Chips (R. Hameed et al., ISCA 2010)". Philipp Habermann.

ASICs are known to be the most energy efficient computing systems for specific applications. General-purpose chips are more flexible but lack orders of magnitude in performance and energy consumption. But what are the reasons for this? In this talk, the results of a widely known paper from ISCA 2010 will be presented to provide a quantitative analysis of the inefficiencies of general-purpose chips. It will also be shown how the incremental customization of the processor allows to reach ASIC performance within 3x of its energy for a H.264/AVC encoder.

## 10.01.2019-10:00: "An Energy-Aware Prediction Methodology for Achievable Bandwidth of Heterogeneous Memory Architectures on FPGA-SoCs". Matthias Goebel.

The trend of using heterogeneous computing and HW/SW-Codesign approaches allows increasing performance significantly while reducing power consumption. One of the main challenges when combining multiple processing devices is the communication, as an inefficient communication configuration can pose a bottleneck to the overall system performance. In this talk, I will at first give a recap of what has been done before, where we proposed a methodology for making good design decisions for FPGA-SoC systems using shared DDR memory for communication. Afterwards, I will present some extensions that I’m currently working at, such as extending the methodology to heterogeneous memory architectures, energy measurement capabilities and support for determining the best set of hardware accelerators.

## 20.12.2018-10:00: "An Efficient and Light Framework for Porting of Vision Algorithms on Embedded Hardware". Apurv Ashish

Introduction of digitalisation or more precisely computer vision has changed the entire landscape of the automobile industry and made autonomous vehicles a reality. The basic requirement for such an automobile is an intelligent system, that can process, analyse and extract information from sensor inputs (such as a camera, odometer). In the case of the camera as an input sensor, the continuous camera frames are processed on onboard embedded hardware to extract features and relevant information. The vision algorithms used for processing camera inputs are computation and memory extensive. While the vision algorithms are initially developed for PC architectures, the stringent restrictions of embedded devices such as low on-chip memory, DMA operations are not addressed. These issues are handled during the porting of the algorithm on an embedded device. For instance, TI’s (Texas Instrument) embedded platforms use BAM (Block Accelerator Manager) framework for integrating algorithms on embedded hardware. The framework being generic in nature has some notable drawbacks which result in higher stack memory usage, execution time and largely in-comprehensible code-base. In this thesis, we propose a new framework for TI hardware, which results in a speedup of about 18%, lower stack usage (88%) and easily reconfigurable code-base. Further in future, a code-generator will be implemented which will exploit the modular nature of the Non-BAM framework, thereby, reducing the porting time of algorithms on embedded devices by a significant factor.

## 13.12.2018-10:00: "Operating System Support for Stream Processing on SoC/FPGA Hybrid ICs". Robert Drehmel.

Stream processing systems -- systems consisting of autonomous processing modules that communicate using streams of data -- are inherently of parallel nature.

In theory, the highly integrated SoC/FPGA hybrid ICs (e.g. the MPSoC series from Xilinx) that recently hit the markets should be good candidates to use the stream processing paradigm, due to the wide variety of on-chip, heterogeneous processing elements and peripherals, and the possibility for the designer to instantiate custom processing pipelines in the programmable logic fabric of the IC. In practice, sufficient tools and work flows (and possibly more importantly operating system support) to exploit stream processing on SoC/FPGAs and to close the performance / productivity gap are still not available.

In this presentation, I will give an overview on my PhD thesis research
that explores combining stream processing with modern SoC/FPGA hybrids.

## 07.12.2018, 10:15h: Accelerator Architectures for Deep Learning: Efficiency vs Flexibility?. Prof. Dr. Henk Corporaal.

• Title: Accelerator Architectures for Deep Learning: Efficiency vs Flexibility?
• Presenter: Prof. Dr. Henk Corporaal (TU Eindhoven, Netherlands)
• Date and time: Friday December 7, 2018 - 10:15h
• Room: EN 644

Abstract:
Deep Learning and Convolutional Neural Networks (CNNs) have revolutionized important domains like machine learning and computer vision. The huge success of deep learning accelerates research in that particular domain and thereby the complexity and diversity of state-of-the-art network models has increased significantly. This opens several challenges for CNN accelerator designers.During this presentation we will go over the state-of-the-art networks and accelerator solutions. We will present our view on compute efficiency and flexibility. We demonstrate that the well-known optimization techniques form the computing industry are key to improve efficiency. We present a holistic approach that improves efficiency by algorithmic optimizations, data reuse improvements, custom accelerators, and the less obvious but very important challenges in code generation.

Bio:
Henk Corporaal is Professor in Embedded System Architectures at the Einhoven University of Technology (TU/e) in The Netherlands. He has gained a MSc in Theoretical Physics from the University of Groningen, and a PhD in Electrical Engineering, in the area of Computer Architecture, from Delft University of Technology.
Corporaal has co-authored over 350 journal and conference papers. Furthermore he invented a new class of VLIW architectures, the Transport Triggered Architectures, which is used in several commercial products, and by many research groups.
His research is on low power multi-processor, heterogenous processing architectures, their programmability, and the predictable design of soft- and hard real-time systems. This includes research and design of embedded system architectures, accelerators, GPUs, the exploitation of all kinds of parallelism, fault-tolerance, approximate computing, architectures for machine and deep learning, and the (semi-)automated mapping of applications to these architectures. For further details see corporaal.org.

## 06.12.2018-10:30: "Improving performance by efficient prefetching for GDDR5X". Farzaneh Salehiminapour.

In this talk, I will discuss about primary results obtained from custom simulation, and then I will present next steps which should be undertaken in this project.

Over the last decade, applications once exclusive to high performance computing are now common in systems ranging from mobile devices to clusters. They typically require large amounts of memory bandwidth. The graphic DRAM interface standard GDDR5X is a new technology that promises almost doubled data rates compared to GDDR5. However, these higher data rates are only possible with a longer burst length of 16 words. This would typically increase the memory access granularity. Moreover, GDDR5X supports an interesting feature called pseudo channel mode. In pseudo channel mode, the memory is split into two 16-bit pseudo channels. This split keeps the memory access granularity constant compared to GDDR5. However, the pseudo channels are not fully independent channels, but access type, bank and page must match. With this restriction, we argue that GDDR5X can best be seen as a GDDR5 memory that allows performing an additional request to the same page without extra cost. Therefore, we aim to develop a DRAM prefetching technique to make effective use of the pseudo channel mode and the additional memory bandwidth offered by GDDR5X.

## 06.12.2018-10:00: "Cost Modeling for Auto-Vectorizers and an Introduction to Scalable Vector Extensions". Angela Pohl.

The first part of this report will summarize the findings of the research about cost modeling for vectorization, which is currently under revision at CC (Compiler Construction).

The second part will provide an introduction to the latest vector extensions by ARM, the Scalable Vector Extensions (SVE). This new instruction set architecture allows to write vector size agnostic code to create more flexible, yet performant programs. Besides introducing the new features of SVE, the talk will also discuss research opportunities to explore the limits of parallelism in applications and highlight the strategy of one of ARM's biggest competitors, RISC-V.

## 29.11.2018-10:30: "GPU Power Modeling and Architectural Enhancements for GPU Energy Efficiency". Jan Lucas.

Graphics Processing Units (GPUs) can now be found in nearly every PC and smartphone. Designed for 3D graphics, they evolved into general purpose accelerators, able to outperform CPUs on many tasks. GPU performance is limited by power consumption. Better energy efficiency enables higher performance.

To increase the energy efficiency, we measure and model the energy consumption of existing GPUs. A custom GPU power measurement infrastructure and an architectural power simulator called GPUSimPow are described and evaluated. Regular architectural simulators do not model the data-dependent energy consumption but assume that energy consumption per operation does not depend on the data. We show that this assumption is not true, but that GPU power consumption can vary by more than 60\% with different data and present two data-dependent power models.

Afterwards, we focus on architectural enhancements to improve the energy efficiency. Two techniques focus on the memory interface: A novel approximation technique reduces the DRAM refresh energy and an optimized encoding scheme reduces the power consumption of the external interface between GPU and DRAM by up to 6\%. We continue with enhancements to improve the energy efficiency of the GPU cores. We evaluate an alternative to the conventional SIMT GPU architecture called temporal SIMT (TSIMT) and extend it to spatiotemporal SIMT. Scalarization can be used to remove redundant operation. We show that spatiotemporal SIMT with Scalarization improves the energy-delay product by 26.2\% compared to conventional GPUs.

## 22.11.2018, 10:15h: Talk: Two Roads to Parallelism: Compilers and Libraries, Prof. Dr. Lawrence Rauchwerger (Texas A&M University, USA), Room: EN 644.

• Title: Two Roads to Parallelism: Compilers and Libraries
• Presenter: Prof. Dr. Lawrence Rauchwerger (Texas A&M University, USA)
• Date: November 22, 2018
• Time: 10:15
• Room: EN 644

Abstract:
Parallel computers have come of age and need parallel software to justify their usefulness. There are two major avenues to get
programs to run in parallel: parallelizing compilers and parallel languages and/or libraries. In this talk we present our latest results
using both approaches and draw some conclusions about their relative effectiveness and potential.
In the first part we introduce the Hybrid Analysis (HA) compiler framework that can seamlessly integrate static and run-time analysis of memory references into a single framework capable of full automatic loop level parallelization. Experimental results on 26 benchmarks show full program speedups superior to those obtained by the Intel Fortran compilers.
In the second part of this talk we present the Standard Template Adaptive Parallel Library (STAPL) based approach to parallelizing code. STAPL is a collection of generic data structures and algorithms that provides a high productivity, parallel programming infrastructure analogous to the C++ Standard Template Library (STL). In this talk, we provide an overview of the major STAPL components with particular emphasis on graph algorithms. We then present scalability results of real codes using peta scale machines such as IBM BG/Q and Cray. Finally we present some of our ideas for future work in this area.

Bio:
Lawrence Rauchwerger is the Eppright Professor of Computer Science and Engineering at Texas A&M University and the co-Director of the Parasol Lab. He is currently a visiting professor at ETH Zurich and will be joining the University of Illinois at Urbana-Champaign in the Fall of 2019. He received an Dipl. Engineer degree from the Polytechnic Institute Bucharest, an M.S. in Electrical Engineering from Stanford University and a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. He has held Visiting Faculty positions at the University of Illinois, Bell Labs, IBM T.J. Watson, and INRIA, Paris. Rauchwerger's approach to auto-parallelization, thread-level speculation and parallel code development has influenced industrial products at corporations such as IBM, Intel and Sun. Rauchwerger is an IEEE Fellow, an NSF CAREER award recipient and has chaired various IEEE and ACM conferences, most recently serving as Program Chair of PACT 2016 and PPoPP 2017.

## 15.11.2018-10:00: "Predictable GPU Frequency Scaling for Energy and Performance ". Kaijie Fan.

Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and energy consumption, and hardware vendors provide management libraries to change both memory and core frequencies. However, it is not straightforward to get the best frequency configuration by setting these frequencies manually. In this talk, I will present our approach, which is based on machine learning and uses static code features, to predict the best core and memory frequency configurations on GPU for an input OpenCL kernel. Furthermore, I will talk about the improvement on my previous paper. Test results show that our modeling approach is accurate on predicting extrema points and Pareto-optimal set for ten out of twelve test benchmarks.

## 08.11.2018-10:00: " Evaluating the Memory Architecture of the Zynq Ultrascale+ FPGA-SoC". Kai Classen.

The UltraScale+ platform will be presented briefly with focus on the Interfaces that connect the Processing System with the Programmable Logic on the FPGA-SoC. Two results regarding the maximum BW will be shown exemplary: The read benchmark results from the non-cache coherent HP port and the read results from the cache coherent HPC port, which will be compared with a Software Baseline. After presenting the evaluation results the conclusions will be drawn, giving specific port recommendation for different use cases.

## 01.11.2018-10:00: "SLC: Memory Access Granularity Aware Selective Lossy Compression for GPUs". Sohan Lal.

Memory compression is a promising approach for reducing memory bandwidth requirements and increasing performance,however, memory compression techniques often result in a low effective compression ratio due to large memory access granularity (MAG) exhibited by GPUs. In this talk, I will present our analysis of the distribution of compressed blocks which shows that a significant percentage of blocks are compressed to a size that is only a few bytes above a multiple of MAG, but a whole burst is fetched from memory. These few extra bytes significantly reduce the compression ratio and the performance gain that otherwise could result from a higher raw compression ratio. To increase the effective compression ratio, we propose a novel MAG aware Selective Lossy Compression (SLC) technique for GPUs. The key idea of SLC is that when lossless compression yields a compressed size with few bytes above a multiple of MAG, we approximate these extra bytes such that the compressed size is a multiple of MAG. This way, SLC mostly retains the quality of a lossless compression and occasionally trades small accuracy for higher performance. We show a speedup of up to 35% normalized to a state-of-the-art lossless compression technique with a low loss in accuracy. Furthermore, average energy consumption and energy-delay-product are reduced by 8.3% and 17.5%, respectively.

## 25.10.2018-10:00: "Domain-specific Approximation of Neural Networks". Thomas Hartenstein.

Deep neural networks have unprecedented success in all kind of fields e.g. in image classification. As mobile applications of these machine learning algorithms are becoming more important the size of deep neural networks is increasing as target of optimization. Approximation techniques can be used for such optimizations as neural networks are inherently tolerant to some error. In this work we propose new pruning techniques that can be applied to larger and deep neural networks to make it feasible to run them on mobile devices while maintaining acceptable levels of accuracy. In this mid-term presentation we give an overview of our work and show some preliminary results.

## 18.10.2018-10:00: Accurate Feature Representation in Predictive Modeling using Cost Relation Functions. Nadjib Mammeri.

Heterogeneous computing has recently become mainstream. Although it provides ample opportunities for exploiting parallelism, it also introduces many challenges such as task mapping and partitioning. One approach to overcome such challenges is the use of predictive modeling. Machine learning based models are trained and used at runtime to make informed decision optimizing either for performance or power. Feature selection and encoding is of essential importance in the effectiveness of such machine learning based models. In this work we present a novel feature encoding technique using cost relation functions. In this presentation, I'll describe the motivation behind this work, show my initial progress so far and highlight future work.

## 02.10.2018-10:00: "ASIPs in system on-chip design: Impressions from ASIP event". Farzaneh Salehiminapour.

Recently, I visitedone day Synopsys event on the topic of "ASIPs in system on-chip design". In this week's talk, I am going to present a summary of this workshop. My presentation will focus on the following three topics, 1.Application-Specific instruction set processor in system on-chip design 2. Design of ASIPs for deep Learning acceleration 3. Design of an ASIP MIMO MME equalizaer for 5G new radio using ASIP designer Unfortunately, they did not share their slides yet, but if you are interested, I can share them as soon as I receive them.

## 11.09.2018-10:00: "Efficient Spatial Locality Predictor for GPU's". Nitesh Agarwal.

As GPU’s computational power grows, their memory hierarchy is increasingly becoming a bottleneck. Current GPU memory accesses are coarse grained, which means, the entire cache block is fetched during a cache miss. Though this method helps to exploit the spatial locality and utilize peak memory bandwidth, coarse grained memory accesses are a poor match for GPU applications with irregular control flow and memory access patterns. In this thesis, we designed an efficient spatiality locality predictor (SLP) for GPU’s that dynamically adapts to coarse grained and fine grained memory accesses depending on the needs of an application. We implemented the proposed SLP in a cycle accurate GPU simulator and compared its accuracy and performance to the state of the art bi-modal granularity predictor (BGP). The average accuracy and speedup of SLP is 22% and 10% higher than the state of the art, respectively. Furthermore, the reduction in bandwidth is 40% more than the state of the art.

## 04.09.2018-10:00: "Efficient code generation for ARM SVE". Mirko Greese.

This presentation will provide a brief overview of the Scalable Vector Extension (SVE) for the ARMv8-A AArch64 architecture. To show the potential benefits for compiler auto-vectorization, I compare benchmark results of kernels run with SVE against ARM NEON. The code patterns of the Test Suite for Vectorizing Compilers (TSVC) were compiled with GCC for both vector instruction sets and used to generate the benchmark data.

## 28.08.2018-10:00: "Improving performance by efficient prefetching for GDDR5X". Farzaneh Salehiminapour.

Over the last decade, applications once exclusive to high performance computing are now common in systems ranging from mobile devices to clusters. They typically require large amounts of memory bandwidth. The graphic DRAM interface standard GDDR5X is a new technology that promises almost doubled data rates compared to GDDR5. However, these higher data rates are only possible with a longer burst length of 16 words. This would typically increase the memory access granularity. Moreover, GDDR5X supports an interesting feature called pseudo channel mode. In pseudo channel mode, the memory is split into two 16-bit pseudo channels. This split keeps the memory access granularity constant compared to GDDR5. However, the pseudo channels are not fully independent channels, but access type, bank and page must match. With this restriction, we argue that GDDR5X can best be seen as a GDDR5 memory that allows performing an additional request to the same page without extra cost. Therefore, we aim to develop a DRAM prefetching technique to make effective use of the pseudo channel mode and the additional memory bandwidth offered by GDDR5X.

## 24.07.2018-10:00: "High-performance and energy-efficient stencil computations on ARM processors". Andreas Getzin.

Stencil computations are very complex and require many read and write operations on memory, it is therefore desirable to find means of writing code in a way that minimize cache misses. Modern processors are very elaborate in design and make it a challenging task to optimize cache usage, which is the reason why autotuning has become an increasingly interesting research field in recent years. Autotuning describes a process in which a machine finds the best ways of optimizing loop operations tailored for a specific processor. Different search algorithms with varying results can be used to find the best parameters, some of them include machine learning techniques as well. Previous studies have already been conducted to investigate the usefulness of autotuning on Intel and other processors with varying degrees of success. In this study the focus is put on an ARM processor and how it compares against an Intel processor. Different stencil codes are autotuned on a quadro-core ARM-Cortex-A53 using a metaheuristic algorithm while the power usage of the processor is recorded via an external measuring device. It is shown that autotuning is actually slightly easier on ARM processors than on Intel and there is an almost negative linear correlation between performance and energy consumption, which however seems to be unaffected by a change in the number of threads. Further research is needed to be sure that a higher number of threads does indeed not affect the behavior of stencil computations on ARM processors or if this may be the result of an incorrect implementation of thread count adjustment.​

## 17.07.2018-10:00: "NCode Model Implementation and Training Evaluation". Malte Koch.

When transmitting image or video data over wireless channels the data size is usually reduced by applying lossy compression which reduces the network payload. A problem with lossy compression algorithms like JPEG/2000 is that they degrade data significantly when the bandwidth of a network drops or transmission errors occur. Generative compression describes a promising image and video compression method (Neural Codec (NCode) and MCode) which is much more resilient to bit error rates. It utilizes Deep Convolutional Neural Networks (DCGANs) instead of handcrafted linear transformations for image encoding and decoding. We trained it on eight different datasets. Our analysis has exposed that NCode does not provide better reconstructions than JPEG/2000, but rather worse. Using NCode as a generic compression algorithm is hardly conceivable due to the poor scalability, limited applicability and unpredictable output quality of the generative compression. However, 34-times stronger compression and significantly better quality degradation in case of bit-errors compared to traditional methods show the great potential of using Generative Adversarial Networks (GANs) for compression. Furthermore, NCode provides a meaningful output at very high compression rates while conventional methods only return blurry unrecognizable results. MCode is able to return better recognizable reconstructions at 19-fold compression than Motion Picture Experts Group (MPEG)-4 (H.264) at 5-fold compression. It can decode and encode at over 30 fps on all tested architectures, nevertheless a GPU is generally preferred over the CPU because it outperforms it with an orders-of-magnitude improvement in decoding, encoding and training time.

## 10.07.2018-10:00: "Introduction". Robert Drehmel.

On July 12, 2018, I will officially change over to the AES group. In this presentation, I introduce myself and give a quick overview of the research I will be working on.

## 26.06.2018-10:00: "Design and Implementation of a High-Performance GPGPU Video Codec". Rafael Fritsch.

Digital video is an important medium and widely used in communication, information services and entertainment. New developments, such as 4K and 8K resolutions or HDR colour, increase the required data processing rates. Video coding on a CPU is, therefore, becoming less feasible. Whereas dedicated video coding hardware is commonly used, it is limited in its flexibility. GPUs are common in end user devices and offer high data throughput for parallel workloads. They are also versatile, due to the introduction of general purpose computing interfaces. The use of the GPU to accelerate video coding has been investigated before, mainly in the context of existing codecs. However, these codecs were not primarily designed with GPU architectures in mind. In order to make the best use of available resources, we have developed a new video codec, specifically for GPUs. To evaluate its performance and different design choices, the codec was implemented using the CUDA framework. A comparison with state-of-the-art CPU codec implementations shows the advantages of our approach.

## 19.06.2018-10:00: "Application of Machine Learning for Polyhedral Optimizations". Ian Zhang

The optimization of code is becoming more and more important for improving time- and energy-efficiency, especially for embedded platforms. Loop optimizations specifically tend to be very relevant, due to a high amount of operations occurring in loops.  The Polyhedral Model is often used for automating loop optimizations, since it can derive a numerical abstraction of those loops, which can then easily be processed by code. Currently, polyhedral loop optimizers tend to be really slow in selection, or end up generating sub-optimal solutions. In this thesis, we will implement a method to approach optimal solutions, while still being fast in selection, by using Machine Learning to approximate correct optimizations.

## 12.06.2018-10:00: "A Model for Low-Energy and High-Performance Frequency Scaling on GPUs". Kaijie Fan.

In modern computer architectures, it is important that applications deliver high performance with low energy consumption. For that reason, many processor designs provide solutions such as dynamic voltage and frequency scaling to tackle power and energy constraints. An example is the Titan X GPU, which allows the programmer to change both memory and core frequencies. However, this task is not straightforward because of the hundreds of possible configurations, and because of the multi-objective nature of the problem, i.e. minimize energy consumption and maximize performance. This paper proposes a method to predict the best GPU core and memory frequency configurations for an input OpenCL kernel. Our modeling approach first analyzes normalized energy and speedup over the default frequency configuration, individually. Then, it combines the two models into a multi-objective one that predicts a Pareto set of frequency configurations. The approach uses static code features, is built on a set of carefully designed micro-benchmarks, and can predict the best frequency settings of a new kernel without executing it.

## 05.06.2018-10:00: "SystemC Modeling of a Memory Paging System for Secure Elements in SoC". Menglong Yuan.

The highest level of security for an embedded device is delivered by Secure Elements (SE). In order to cover the full span of the security features for IoT devices built on a SoC platform, the industry responses with a new architecture concept, the Integrated Secure Element (iSE), as a drop-in security solution for IoT platforms. The iSE enables the host SoC platform with security features such as secure boot, secure debug, secure firmware update, rollback protection and many more. The major challenge of the iSE design is the fact that the host SoC platform does not have embedded Non-Volatile Memory (NVM) available for the iSE to store the dedicated OS and the secure applications on chip. The only solution is to build a memory interface for iSE to operate on the untrusted off-chip memory external to the SoC platform. This thesis presents the architecture of the iSE Paging System that allows the iSE to access the external NVM and provides the same level of security as the iSE framework. This design is based on previous works on trusted off-chip memory systems that aims at copy protection and data integrity. What the iSE Paging System advances from the previous designs, is the desired feature of rollback protection through page authentication and version tracking. With these advancements, the Paging System provides the iSE with secure Virtual Memory Operations, which acts as a generic memory interface to the external untrusted NVM.

## 29.05.2018-10:00: "Performance Counters Based GPU Power Modeling using Machine Learning". Markus Neu.

GPUs have become important computational units on mobile and embedded devices. As power consumption is considered a critical factor on these devices, it is becoming of great importance to be able to estimate and understand how GPU applications behave in terms of power. In this work we introduce a machine learning power model for a mobile GPU using hardware performance counters and real power measurements. As part of this mid-term presentation, I'll focus on the design of the power model, starting with the basic principles of NN and construction. The first step to the model contains the evaluation and processing of a given example data set. Then a basic model with the Keras framework has been built and employed to get some first results. From this position the model has been improved with a theory to achieve the presented results, though the quality of the data has an major impact on the resulting model capability.

## 22.05.2018-10:00: "Control Flow Vectorization for ARM NEON". Angela Pohl.

Single Instruction Multiple Data (SIMD) extensions in processors enable in-core parallelism for operations on vectors of data. From the compiler perspective, SIMD instructions require automatic techniques to determine how and when it is possible to express computations in terms of vector operations. This work analyzes the challenge of generating efficient vector instructions by benchmarking 151 loop patterns with three compilers on two SIMD instruction sets. Comparing the vectorization rates for the AVX2 and NEON instruction sets, we observe that the presence of control flow poses a major problem for the vectorization on NEON. We consequently propose a set of solutions to generate efficient vector instructions in the presence of control flow. Results show that we enable vectorization of conditional read operations with a minimal overhead, while our technique of atomic select stores achieves a speedup of more than 2x over state of the art for large vectorization factors.

## 03.05.2018-10:00: "Design and Implementation of a high-performance GPGPU Video Codec". Rafael Fritsch.

Digital video is an important medium and widely used in communication, information services and entertainment. New developments, such as 4K and 8K resolutions or HDR colours, increase the required data processing rates. Video coding on a CPU is therefore becoming less feasible. Whereas dedicated video coding hardware is commonly used, it is limited in its flexibility towards new developments. Moreover, it is unavailable for other uses.
GPUs on the other hand offer high data throughput and are highly versatile, due to the introduction of general purpose computing interfaces. We have designed a new video codec that fully exploits available GPU resources to achieve high coding performance. The codec was implemented in the CUDA framework for Nvidia GPUs. Different design choices have been evaluated and a comparison to state-of-the-art codec implementations for CPUs will be made.

## 26.04.2018-10:00: "OpenABL: A Domain-Specific Language for Parallel and Distributed Agent-Based Simulations". Nikita Popov.

Agent-based simulations are becoming widespread among scientists from different areas, who use them to model increasingly complex problems. To cope with the growing computational complexity, parallel and distributed implementations have been developed for a wide range of platforms. However, it is difficult to have simulations that are portable to different platforms while still achieving high performance. We present OpenABL, a domain-specific language for portable, high-performance, parallel agent modeling. It comprises an easy-to-program language that relies on high-level abstractions for programmability and explicitly exploits agent parallelism to deliver high performance. A source-to-source compiler translates the input code to a high-level intermediate representation exposing parallelism, locality and synchronization, and, thanks to an architecture based on pluggable backends, generates  target code for multi-core CPUs, GPUs, large clusters and cloud systems. OpenABL has been evaluated on six applications from various fields such as ecology, animation, and social sciences. The generated code scales to large clusters and performs similarly to hand-written target-specific code, while requiring significantly fewer lines of codes.

## 19.04.2018-10:00: "A Study of Vectorization on ARM". Nicolás Morini.

The ability to vectorize different code patterns and generate quality code is of utmost importance for vectorizing-compilers in order to take advantage of SIMD extensions present in modern processors. This study explored how well GCC and LLVM support auto-vectorization in an ARM Cortex-A53 processor compared to an Intel Kaby-Lake processor.
By using a synthetic benchmark, it was found that only between 42% and 50% of the test loop patterns were vectorized. Furthermore, each compiler vectorized a different subset of the loops for each of the hardware platforms. When considering the reasons why vectorization failed, it was observed that out of 14 loops vectorized by LLVM for the Intel processor, but not for ARM, 13 are due to the presence of control flow. This is rooted in the fact that the NEON vector extensions in the Cortex-A53 do not provide masked-load and masked-store vector instructions, while the AVX extensions for Intel do.
To overcome the lack of a masked-load instruction, the use of a pragma statement to hint the compiler about the safety of hoisting the load is suggested. Next, after identifying and fixing an implementation defect in LLVM’s source code, the use of scalar predicated stores was enabled as an alternative for masked-stores. Lastly, two new techniques using a vector select instructions are proposed and implemented into LLVM: select-store (ST) for single-threaded applications; and atomic-select-store (AST) for multi-threaded applications. By using the proposed techniques, the number of vectorized loop patterns by LLVM for ARM equals the one for Intel. Lastly, the main factors that affect the profitability of using the proposed techniques were identified and benchmarked. It was found that vectorization is always profitable for 8- and 16-bits data types. While for 32-bits data types, the profitability highly depends on the kernel being vectorized and the branching ratio of the conditional block. Data types 64-bits wide always resulted in slowdowns.

## 12.04.2018-10:00: "SIMD Erweiterung für die RISC-V ISA". Hannes Kadlik.

Ein Entwicklungsteam aus Berkeley erstellte die neue Befehlssatzarchitektur RISC-V, welche gegenüber andren Befehlssatzarchitekturen ein neues Konzept bei der Nutzung von Instruktionen verfolgt. Ein Basisbefehlssatz ist verfügbar, der durch Erweiterungen ergänzt werden kann. Durch diese können Gleitkommazahlen, Multiplikations- und Divi-sionsinstruktionen oder SIMD Instruktionen genutzt werden.
Um die Frage zu beantworten, wie die Erweiterung mit SIMD Instruktionen im Vergleich zu anderen SIMD Erweiterungen abschneidet, wurden deren Assemblercodes für eine Schleife, welche eine arithmetische Funktion ausführt, ausgewertet. Die Ergebnisse zeigen, dass die Erweiterung von RISC-V über das 20-Fache an Ergebnissen pro Instruktion für eine Iteration berechnen kann. Jedoch wird auch am Beispiel von ARM gezeigt, dass mit zunehmender Datenmenge die Speicherzugriffszeit erhöht wird. Dies wird den möglichen Speedup der RISC-V Erweiterung mindern.

## 05.04.2018-10:00: "OpenMP Accelerator Offloading with OpenCL using SPIR-V". Daniel Schürmann.

For many applications modern GPUs could potentially offer a high efficiency and performance. However, due to the requirement to use specialized languages like CUDA or OpenCL, it is complex and error-prone to convert existing applications to target GPUs. OpenMP is a well known API to ease parallel programming in C, C++ and Fortran, mainly by using compiler directives. With the introduction of new features in OpenMP 4.0 to support offloading to accelerator devices, OpenMP became a potential programming model to utilize GPUs and easily convert existing code.
In this work, we design and implement an extension for the Clang compiler and a runtime to offload OpenMP programs onto GPUs using a SPIR-V enabled OpenCL driver. The OpenMP execution model was successfully mapped to the OpenCL execution model providing an efficient technique to parallelize applications while maintaining portability. According to our experiment results, we were able to obtain a speed-up of 327.8 for a simple kernel and a speed-up of 1.9 for a real-world benchmark application compared to serial code.

## 29.03.2018-10:00: "Design and Implementation of an Efficient Spatial Locality Predictor for GPU's". Nitesh Agarwal.

As GPU’s computational power grows, their memory hierarchy is increasingly becoming a bottleneck. Current GPU memory accesses are coarse grained, which means, the entire cache block is fetched during a cache miss. Though this method helps to exploit the spatial locality and utilise peak memory bandwidth, coarse grained memory accesses are a poor match for GPU applications with irregular control flow and memory access patterns. Thus, the goal of this thesis is to design an efficient spatiality locality predictor (SLP) for GPU’s that could dynamically adapt to coarse grained and fine grained memory accesses depending on the needs of the application. This would save memory bandwidth and will significantly improve the performance of GPU applications. In the mid-term presentation, I will present the initial results for predictor accuracy compared to the state-of-the-art.

## 15.03.2018-10:00: "SystemC Modeling of a Memory Paging System for Integrated Secure Elements in SoC". Menglong Yuan.

The Integrated Secure Element (iSE) is the next generation security solution designed specifically for IoT products with high requirements on security and privacy. iSE integrates the state-of-the-art secure element technology into the system on chip (SoC) itself, bringing more security features and a much smaller footprint. The major challenge of the iSE is the fact that the target SoC platform has no non-volatile memory (NVM) to persistently store the protected data. Therefore, a memory paging system is needed to handle and store the protected data on an untrusted off-chip memory. This thesis evaluates the iSE memory paging architecture using SystemC modeling. The goal is to perform runtime analysis of the data integrity, as well as to compare the performance with the baseline secure element.

## 08.03.2018-10:00: "Optimal DC/AC Data Bus Inversion Coding". Jan Lucas.

GDDR5 and DDR4 memories use data bus inversion (DBI) coding to reduce termination power and decrease the number of output transitions. Two main strategies exist for encoding data using DBI: DBI DC minimizes the number of outputs transmitting a zero, while DBI AC minimizes the number of signal transitions. We show that neither of these strategies is optimal and reduction of write termination power of up to 6%  can be achieved by taking both the number of zeros and the number of signal transitions into account when encoding the data. We then demonstrate that a hardware implementation of optimal DBI coding is feasible and requires only an insignificant amount of additional die area. This work has been accepted at DATE 2018, and this talk will be a rehearsal and feedback for the conference presentation.

## 22.02.2018-10:00: "Local Memory-Aware Kernel Perforation". Daniel Maier.

Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate computation is performed, e.g., by improving the quality of approximation with data reconstruction techniques. This paper introduces local memory-aware perforation techniques specifically designed for the acceleration and approximation of GPU kernels. We propose a local memory-aware kernel perforation technique that first skips the loading of parts of the input data from global memory, and later uses reconstruction techniques on local memory to reach higher accuracy while having performance similar to state-of-the-art techniques. Experiments show that our approach is able to accelerate the execution of a variety of applications from 1.6× to 3× while introducing an average error of 6%, which is much smaller than that of other approaches. Results further show how much the error depends on the input data and application scenario, the impact of local memory tuning and different parameter configurations.

## 15.02.2018-10:00: "A Bitstream Partitioning Approach for high-throughput CABAC Decoding in Next Generation Video Coding". Philipp Habermann.

High Efficiency Video Coding (HEVC/H.265) is today's state-of-the-art video coding standard. Compared to its predecessor H.264/AVC, it enables video compression at half the bitrate while maintaining the same video quality. Context-based Adaptive Binary Arithmetic Coding (CABAC) is the main throughput bottleneck for high quality video decoding. High-level parallelization tools, such as Tiles and Wavefront Parallel Processing, have been specified in the HEVC standard. However, they require the replication of the full CABAC decoding hardware. We propose a Bin-based Bitstream Partitioning (B3P) scheme for CABAC hardware decoding to address these issues. Parallel processing is enabled by distributing bins among eight partitions. Significant speed-ups of up to 8.5x can be achieved while only 9.2% extra hardware is required. The B3P hardware decoder can process up to 3.94 Gbins/s and the bitstream overhead is also negligible for high bitrates. Compared to the related Syntax Element Partitioning, we achieve higher throughput at similar hardware cost and coding efficiency.

## 31.01.2018-10:00: "Control Flow Vectorization for ARM NEON". Angela Pohl.

Single Instruction Multiple Data (SIMD) extensions in processors enable in-core parallelism for operations on vectors of data. From the compiler perspective, SIMD instructions require automatic techniques to determine how and when it is possible to express computations in terms of vector operations. This work analyzes the challenge of generating efficient vector instructions by benchmarking 151 loop patterns with three compilers on two SIMD instruction sets. Comparing the vectorization rates for the AVX2 and NEON instruction sets, we observe that the presence of control flow poses a major problem for the vectorization on NEON. We consequently propose a set of solutions to generate efficient vector instructions in the presence of control flow. Results show that we enable vectorization of conditional read operations with a minimal overhead, while our technique of atomic select stores achieves a speedup of more than 2x over state of the art for large vectorization factors.

## 24.01.2018-10:00: "Towards Cross-Platform GPGPU: Enabling Vulkan to Gain Prominence over CUDA & OpenCL". Nadjib Mammeri.

The recent introduction of the Vulkan API along with SPIR-V provides a new programming model for writing and tuning GPU applications. Although Vulkan's main focus is on graphics, it also supports compute, offers fine-grained control over modern GPUs and promises to be portable across different architectures. In this paper, we examine Vulkan from the compute perspective and show that it can be regarded as a new GPGPU programming model, notably on mobile devices. However, GPGPU programmers might be hindered to adopt Vulkan given the fact that it is low-level and requires a higher programming effort. To mitigate this, we propose VComputeLib, a runtime library that adds an abstraction layer on the top of the low-level API, lowering Vulkan's programmability effort. VComputeLib employs further optimization techniques: device queue virtualization and granular memory management, enabling developers to write efficient platform-agnostic applications. We also extend the Rodinia benchmark suite by developing a set of Vulkan compute benchmarks using our proposed VComputeLib runtime. We conduct a thorough performance comparison of Vulkan versus CUDA and OpenCL on different hardware platforms, utilizing our developed benchmarks along with Rodinia's CUDA and OpenCL implementations. Our results show that Vulkan offers better performance with average speedups of 1.53x and 1.66x compared to CUDA and OpenCL respectively and up to 8x speedup for certain benchmarks. On mobile platforms, Vulkan offers 1.59x average speedup compared to OpenCL. We also show that the programmability of Vulkan can be improved substantially with the help of runtime libraries such as VComputeLib, lowering the burden of getting it adopted as the framework of choice for writing cross-platform GPGPU applications.

## 17.01.2018-10:00: "Accurate Speedup Prediction in Auto-Vectorizers". Angela Pohl.

Compiler optimization passes use cost modeling to understand whether a code transformation yields an improvement compared to its baseline version. If this assessment is not accurate, a compiler might apply transformations that are not beneficial, or refrain from applying ones that would have improved the code. In this presentation, I will discuss my analysis of the profitability prediction in LLVM’s and GCC’s vectorization passes. Since the key result is a low correlation between predicted and measured speedup, I will present a technique to tailor cost models to target hardware platforms and show how the correlation and subsequently the performance prediction can be improved for real-world codes.

## 19.12.2017-10:00: "NCode Model Implementation and Training Evaluation". Malte Koch.

When transmitting image/video data over wireless channels the data size is usually reduced by applying lossy compression which reduces the network payload. A problem with lossy compression algorithms like JPEG/2000 is that they degrade data significantly when the bandwidth of a network drops or transmission errors occur. The neural codec architecture NCode describes a promising image and video compression method which is much more resilient to bit error rates. It utilizes Deep Convolutional Neural Networks (DCGANs) instead of handcrafted linear transformations for image encoding/decoding. We trained NCode on nine different datasets. While NCode yields astonishing results on the KTH hand-waving dataset, it fails horribly on the LSUN classrooms dataset. Therefore the NCode compression method cannot be used as a generic compression algorithm. NCode can decode/encode at 2065/8000 fps on a NVIDIA GTX 1080 TI GPU and decode/encode at 63/1560 fps on a Intel 7700k CPU. It is possible to reach 6-fold higher compression levels than JPEG/2000 with higher quality reconstructions.

## 29.11.2017-10:30: "An Evaluation of a High-Level Synthesis Tool for FPGAs". Jonas Tröger.

High-Level Synthesis enables the design of hardware without the usage of Hardware Description Languages and promises a faster and more efficient design process compared to conventional design processes using HDLs. The Intel FPGA SDK for FPGAs provides a HLS Tool which compiles OpenCL source code for Intel FPGAs and an OpenCL runtime environment. The HLS Tool uses a special approach for generating the hardware structure, which differs from the traditional FSMD approach and uses specific features of the OpenCL programming language. To evaluate the capabilities of the Tool, the general hardware generation approach was analyzed and the results of the Tool were compared to VHDL implementations. Five different OpenCL kernels were selected which focus on different challenges of hardware generation. For each kernel a base and multiple optimized OpenCL implementations were done, the optimizations were done manually. The VHDL implementation implements exactly the same functionality and uses the same hardware interface and synthesis flow as the OpenCL implementations. The implementations were synthesized and executed on a development board or in a simulation environment, which was developed for the comparison. The performance and the resource utilization of the Implementations were compared and possible reasons for the disparities were identified. The comparison shows that the disparity between OpenCL and VHDL implementations grows with the complexity of memory accesses and control flow. While the OpenCL and VHDL implementations perform similar for simple kernels, the optimized OpenCL implementation is worse compared to the VHDL implementation by a factor of 4 in terms of performance and resource utilization for selected complex kernels.

## 29.11.2017-10:00: "DRAM Prefetching Techniques on GDDR5X". Farzaneh Salehiminapour.

In this talk, I will give you a brief information about my previous background and research interest, thereafter, I will present my current research plan which mainly focuses on GDDR5X (New Graphic Memory). Based on reported features on GDDR5x standard, we will improve the performance of existing memory hierarchy using prefetching techniques.

## 22.11.2017-10:00: "Minimizing Interference between Multiple Time-of-Flight Cameras". Siby Thachil.

Time-of-Flight (ToF) cameras provide real-time depth information along with color image at a frame rate of 30 fps. ToF cameras illuminate the scene with an active light source and a built-in photodetector captures the reflected light. Depth data is extracted by calculating the total time of flight for the emitted light pulses. When two or more ToF cameras are present in a scene, mutual interference may be significantly high. Due to the limited processing power and memory, manufacturers deploy simple solutions such as hardware counter to minimize the overlap between illumination periods. To reduce the mutual interference, three schemes are explored: randomization of illumination,  frequency division multiple access (FDMA) and round-robin illumination. The first approach randomizes illumination by applying pseudo-random codes. FDMA operation signifies running different cameras at different base frequencies. In round-robin scheme, a master controller synchronizes the illumination and eliminates mutual interference. The three schemes are implemented in Fotonic G-series ToF cameras and evaluated in the same test setup. In a typical use-case, randomizing illumination with GPS Coarse / Acquisition pseudo-random codes is more robust than FDMA scheme. The three schemes are compared and the results are presented.

## 15.11.2017-10:00: "Limits of Instruction-Level Parallelism 2017". Susheel Puranik.

Computer-systems performance analysis is more of an art than science. Researchers often reach different conclusions when analyzing the same system. These disagreements often lead to scientific communications and ultimately clarity on the research topic. On the other hand, authors of well-established books, use these scientific communications to teach various concepts of embedded systems engineering to aspiring researchers. "Computer Architecture - A Quantitative Approach" by John Hennessey and David Patterson (H&P book) is one such well-established book. Though the conceptual findings in the book are as firm as rock, the scientific research to support the findings are becoming obsolete. The very basic parallelism techniques like Instruction-Level Parallelism (ILP), Thread Level Parallelism (TLP) etc. have not been upgraded with newer technological results and experiments using modern computer performance analysis. In this thesis, we focus on the Limits of Instruction-Level Parallelism, which has been adopted from the year 1991 and 1993 research work of Dr. David W. Wall (Wall's Experiment). This experiment is well researched but is outdated for almost 25 years, and it is still being used to teach the fundamentals of ILP. In this thesis, a trace-driven simulator replicates Wall's experiment by using a modern GEM5 simulator model along with workloads of SPEC CPU1995 and SPEC CPU2006 benchmarks suite. A detailed study on the verification of the GEM5 simulator model has also been conducted, with the help of micro-benchmarks. When compared to Wall's experiment, this thesis reports that the branch prediction in modern compilers has increased the limits of ILP significantly. With better branch-predictors, better IPC has been obtained. Also, the unique nature of individual floating-point and integer benchmarks have shown interesting insights for various performance tests. Wall's experiment has shown us that pushing ILP much further would be extremely difficult, but with the results from this thesis shows that after 25 years, these challenges are being pushed to higher limits. This thesis result will show a remarkable interplay of the trends of compilation technology along with the advancement of computer architecture.

## 07.11.2017-10:30: "Detection and classification of heartbeats from raw ECG data using small neural networks". Andrés Gunnarsson.

This presentation explores and discusses the possibilities of using a very small and low-power neural network to detect and classify heartbeats from raw ECG data. Instead of pre-processing the raw data to extract features and then train the network on those features, this work aims to enable the network to classify the data straight from the raw input ECG. The dataset for training and testing is the MIT-BIH arrythmia database. A comparison will be made to other methods using the same dataset.

## 07.11.2017-10:00: "Evaluating OpenCL performance and energy consumption on different parallel architectures". Kaijie Fan.

In this talks, I will first give a brief introduction of my self, my background, and my initial research plan. Then, I will shortly introduce my current work in OpenCL. My goal is to analyze the correlation between execution time and energy consumption on  a variety of OpenCL-capable devices. For energy consumption measurements, I am using the Running Average Power Limiting (RAPL) Intel library and the NVIDIA Management Library (NVML). From this analysis, my future work will focus on improving the existing programming models and modeling for both higher energy efficiency and performance.

## 01.11.2017-10:30: "OpenMP Accelerator Offloading using OpenCL with SPIR-V". Daniel Schuermann.

For many applications modern GPUs could potentially offer a high efficiency and performance. However, due to the requirement to use specialized languages like CUDA or OpenCL, it is complex and error-prone to convert existing applications to target GPUs. OpenMP is a well known API to support parallel programming in C, C++ and Fortran, mainly by using compiler directives. With the introduction of new features in OpenMP 4.0 to support devices with offloading capabilities, OpenMP became a potential programming model to utilize GPUs and easily convert existing code. In this work, we extend Clang and implement a run-time to offload OpenMP programs onto GPUs using a SPIR-V enabled OpenCL run-time.

## 01.11.2017-10:00: "Development of a RICH particle identification algorithm on Intel's Knight's Landing Platform". Christina Quast.

At CERN, particles are collided in order to understand how the universe was created. In the LHCb experiment, proton-proton collisions are used to investigate the matter-antimatter asymmetry of the universe. An non-optimized implementation of the pattern recognition algorithm for the current CPU hardware existed before, written in C++. For my master thesis the algorithm was ported to and optimized for the Intel's Xeon Phi Knight's Landing (KNL), a platform designed for HPC (High-Performance Computing). The memory accesses and data representation in memory for the program have been researched in detail to determine the best optimization techniques to apply to the RICH pattern detection algorithm used in the LHCb experiment. The hotspots of the code were identified and optimized first, because they had the biggest influence on the performance. Parallelization and vectorization was applied in order to make use of all the resources provided by the Knights Landing processor.

## 24.10.2017-10:00: "A Generic Userspace Hardware Abstraction Layer for Linux SoCs". Uffke Drechsler

Co-designs for FPGA-SoCs demand OS drivers for data exchange between software and hardware. These drivers are often neglected (especially while developing co-designs) and end up done sloppy or even broken. User space drivers are an alternative way of interfacing FPGA designs, but Linux does not support unprivileged ones. An intermediate layer (hwfs) was created that is configured and accessed from user space to enable such drivers. Hwfs lets unprivileged programs interface the hardware via file system operations and applies strict permission sets, making it suitable for developing, testing, debugging and even deployment.

## 18.10.2017-10:00: "Detection and classification of heartbeats from raw ECG data using small neural networks". Andrés Gunnarsson

This presentation explores and discusses the possibilities of using a very small and low-power neural network to detect and classify heartbeats from raw ECG data. Instead of pre-processing the raw data to extract features and then train the network on those features, this work aims to enable the network to classify the data straight from the raw input ECG. The dataset for training and testing is the MIT-BIH arrythmia database. A comparison will be made to other methods using the same dataset.

## 10.10.2017-10:00: "Deep-Learning on-chip - Impressions from a summer school". Matthias Gobel

I recently visited a summer school on the topic of "Deep Learning on-chip". In this talk, I'm going to give you a brief overview of the contents that were presented. In particular, there were five different lectures: Deep Learning basics: from vectors to sequences Plenty of Room at the Bottom? Micropower Deep Learning for Cognitive Cyberphysical Systems Emerging Technologies for Neuromorphic Computing Convolution filters with High Level Synthesis tools Accelerating Deep Neural Networks on FPGAs I'm going to discuss some of these topics and give the main conclusions and insights that I gained at the summer school. If you are interested in any of these topics, I can share the original slides with you for further studies.

## 04.10.2017-10:30: "Design and Implementation of a Configurable Convolution Filter for Deeplearning". Lars Schmik.

Convolutional Neural Networks, also known as deep neural networks are getting more and more important when it comes to live pattern recognition in abstract data, especially for speech recognition e.g. in mobile phones or image recognition e.g. of objects/faces in automobile (Tesla's car autopilot) or smart home environments. In Image processing, deep nets aim to find a set of locally connected neurons, which form so called representative features of the object to recognize, where classic neural nets just aim to learn a full weight matrix between two connected net layers. Feature recognition and learning in CNNs is done using convolution filters, where the filter kernels are not fixed, but data-specific. Convolutional Filters are computationally expensive, so the next technological step after using GPU power through OpenCL or CUDA is to parallelize filter operations as much as possible to speed up learning but also classification time. In this presentation I explain my FPGA design of a parallel configurable convolutional filter for a CNN in a Xilinx ZYNQ SoC developed during my bachelor thesis, evaluate it and give an outlook on possible optimizations and limits.

## 04.10.2017-10:00: "Quantitative Analysis of Advanced Computer Architecture Techniques: Dynamic Exploitation of ILP". Wenbo Li.

The book Computer Architecture: A Quantitative Approach is well-written, but it is still can be improved. For example, the chapter 3 of that book mainly discusses the instruction-level parallelism and its exploitation. The branch prediction and some dynamic scheduling techniques have been discussed in the book. However, for the branch prediction part, the book uses the old-est SPEC89 benchmarks results which need to be updated using the recent benchmarks. Furthermore, when the book introduces the dynamic scheduling techniques such as out-of-order execution, speculation, and superscalar, these techniques are only demonstrated using paper-and-pencil examples, it does not have any experimental results to support the theories. Some experiments need to be designed to motivate those techniques quantitatively. For the branch prediction part, the author evaluates the prediction accuracy on SPEC2006 benchmarks, and compares with the oldest SPEC89 benchmarks, to see what has been changed during last 20 years. Furthermore, for the dynamic exploitation techniques part, the author designs several experiments to demonstrate the performance benefit for those techniques.

## 26.09.2017-10:00: "A Reconfigurable Architecture for Real-Time Image Compression On-Board Satellites". Kristian Manthey.

Datenprodukte von optischen Fernerkundungssystemen finden zunehmend Anwendung in unserem täglichen Leben. Die räumliche als auch die spektrale Auflösung von Satel-litenbilddaten steigt mit neuen Missionen stetig, was zu einer höheren Präzision der bekannten Verfahren sowie neuen Anwendungsszenarien führt. Während die Anforderun-gen an die Speicherkapazität noch erfüllt werden können, wird die Übertragungskapazität zunehmend problematisch. Eine Echtzeit-Übertragung hochaufgelöster Bilddaten ist derzeit nicht möglich. Diese Arbeit stellt eine neue Architektur zur Bilddatenkompression vor, welche für ak-tuelle und zukünftige Projekte am Deutschen Zentrum für Luft- und Raumfahrt (DLR) verwendet werden kann. Die Architektur besitzt eine Unterstützung für so genannte Regions of Interest und ermöglicht einen flexiblen Zugriff auf die mit dem Standard CCSDS 122.0-B-1 komprimierten Bilddaten. Eine Region of Interest-Kodierung kann bei Anwendungen nützlich sein, bei denen Anbord-Klassifizierung oder -Registrierung, sowie Objekt- oder Änderungsdetektionsalgorithmen eingesetzt werden. Sie ist auch nützlich, um die Menge an Daten, welche an die Bodenstation übertragen werden müssen, zu reduzieren. Modifikationen am Standard wurden vorgenommen, um eine Veränderung der Kompressionsparameter und die Neuorganisation des Bitstroms nach der Kompression zu ermöglichen. Es wird zusätzlich ein Index der komprimierten Daten erstellt, welcher es ermöglicht, einzelne Teile des Bitstroms zu lokalisieren. Auf Wunsch können die komprimierten und gespeicherten Bilder je nach den Anforderungen der Anwendung und wie von der Bodenstation angefordert, zusammengesetzt und übertragen werden. Anforderungen, das Design einer Architektur sowie deren Implementierung auf Basis von rekonfigurierbarer Hardware werden vorgestellt. Die Architektur wird fÃ1⁄4r einen Xilinx Virtex 5QV entwickelt, wobei eine einzelne Instanz der Architektur in der Lage ist, Bilder mit einer Rate von bis zu 200 Mpxs zu komprimieren. Sie arbeitet mit einer Taktfrequenz von 100 MHz und prozessiert dabei zwei Bildpunkte pro Taktzyklus. Ein Xilinx Virtex-5QV ermöglicht die dabei die Komprimierung von Bildern mit einer Breite von bis zu 4096 Bildpunkten ohne die Verwendung von externem Speicher. Ohne externen Speicher und zusätzliche Schnittstellen liegt der Leistungsaufnahme der Architektur bei etwa 4 W. Bei der vorgestellten Architektur handelt es sich um eine der schnellsten Implementierungen, die bisher existieren, welche zudem für aktuelle hochauflösende Systeme geeignet ist. Untersuchungen im Ressourcen- und Energieverbrauch, sowie bei der Verfügbarkeit externer Speicher haben gezeigt, dass es möglich sein sollte, das Design direkt auf einer Fokalebene zu integrieren.

## 19.09.2017-10:30: "Energy Optimized CPU+GPU HEVC decoding using GPU frequency scaling && GPU power benchmarking". Biao Wang.

NVIDIA GPUs by default use dynamic voltage and frequency scaling to achieve better energy efficiency. Under the circumstance of CPU+GPU HEVC decoding, however, this strategy sometimes leads to low energy efficiency. In this talk, we show how to increase the energy efficiency of HEVC decoding on CPU+GPU systems using GPU frequency scaling. In the second part of this presentation, we will present several micro-benchmarks for a GPU power model based on linear regression. The goal of developing more distinct micro-benchmarks is to provide more training data for the power model. In addition, a divide-and-conquer method for improving the power model will also be presented.

## 19.09.2017-10:00: "A study of Vectorization on ARM". Nicolas Mariano Morini.

In order to take advantage of SIMD extensions on modern processors, programmers rely on auto-vectorizing compilers. Although in many cases compilers do a good job when it comes to vectorization, they have been proven to only have a 40-60% overall effectiveness at vectorizing simple code patterns on x86 platforms. Since the recent popularity of ARM processors in the low-power mobile market, the question “How do they compare against x86 processors regarding vectorization?” arises. Hence, a study about the current performance of modern vectorizing compilers on ARM processors when compared with x86 processors is proposed.

## 13.09.2017-10:00: "An Energy-Aware Prediction Methodology for Achievable Bandwidth of Heterogeneous Memory Architectures on FPGA-SoCs".Matthias Goebel.

The trend of using heterogeneous computing and HW/SW-Codesign approaches allows increasing performance significantly while reducing power consumption. One of the main challenges when combining multiple processing devices is the communication, as an inefficient communication configuration can pose a bottleneck to the overall system performance.

In this talk, I will at first give a recap of our last paper, where we proposed a methodology for making good design decisions for FPGA-SoC systems using shared DDR memory for communication. Afterwards, I will present our ideas for extending the methodology. These include support for energy measurement, heterogeneous memory architectures as well as for multiple HW accelerators running in parallel.

## 13.09.2017-10:30: "Introduction to Silexica and SLX tools". Won-Tae Joo

Silexica is a compiler start-up which was founded on 2014 as a spin-off from an Institute of RWTH Aachen University. Our product SLX is mainly focused on assisting both manual and automatic parallelization of C/C++ code targeted for embedded platform in a static / dynamic approach. The presentation will be an introduction about Silexica and our product SLX.

## 07.09.2017 - 10:00: "Adaptive Face Recognition Using Convolutional Neural Network". Ahmad Chamas.

Deep learning has recently become one of the most important topics in computer vision, which is a branch of machine learning. Convolutional Neural Networks (CNNs) are a new efficient tool for deep learning which can automatically learn features from the input image to achieve a high recognition rate in image classification. The purpose of this thesis is to investigate machine learning techniques, and specifically the use of a CNN in the problem of recognizing the human face. The work aims to create a CNN model, and investigate methods of speeding up the training and accuracy rate of this model by determining appropriate parameters. However, the main challenge is to make the network model adaptive. This means the network model should be able to learn more classes (faces) in a more organic manner similar to the human brain, minimizing the training time by partially retraining the model to add the new class rather than requiring full retraining. Transfer learning is investigated as a technique to handle changes to the dataset (addition of a new class). This technique is realized by applying an algorithm to take the pre-trained CNN weights and replacing the last layer (the fully connected layer), and training the model with the pre-trained convolutional layer. The Extended Yale B dataset is used in addition to the researcher's own dataset in training the CNN model. The dataset contains varying illumination conditions and poses. The proposed CNN model performed well on both CPU and GPU architectures. The CPU architecture test achieved 94.2% accuracy with 107 min of estimated training time; the GPU test achieved 93.95% accuracy with 41 min estimated training time. After the addition of a new class performance improved significantly, bringing test accuracy up to 95.17% and 95.5% on CPU and GPU architectures respectively. The estimated training time was halved after re-training.

## 24.08.2017-10:00: "Limits of Instruction Level Parallelism 2017". Susheel Puranik.

Due to the breakdown of Dennard Scaling in 2006, Design Space Exploration (DSE) of modern processors often entails a trade-off between performance and power limits. These real world physical issues have limited us to understand the fallacies in the limits of our workload distribution to multiprocessor systems. The very basic parallelism like Instruction Level Parallelism (ILP) are often left behind from being updated with the advent of newer technologies in modern computer performance analysis. This presentation includes the analysis of the limits of ILP using a trace driven simulation and SPEC 2006 benchmarks. Such work was previously performed in the year 1993 by David W. Wall. Since then we have overcome many issues written in the Limits of ILP 1993 paper. The results of our work are a comparison of the outdated issues with the newer results using a modern simulator GEM5. The presentation will also include a small description and selection process of the GEM5 simulator.

## 17.08.2017-10:30: "Quantitative Analysis of Advanced Computer Architecture Techniques-Dynamic Exploitation of ILP". Wenbo Li.

The textbook ''Computer Architecture–A Quantitative Approach'' by John Hennessy and David Patterson is well written, but no experimental results are provided that demonstrate the performance benefits of out-of-order execution, speculation, etc. The performance improvements due to these techniques are mainly demonstrated using paper-and-pencil examples.

So I design experiments to provide Quantitative Analysis of those techniques. I compare different modern processor simulators and select the most suitable one which is GEM5 and also extend the simulator. My work mainly focuses on Branch Prediction, Dynamic Scheduling, Speculation, and Superscalar. I would like to show my results regarding the above topics.

## 17.08.2017-10:00: "Minimizing Interference in Multi-ToF Camera Environments". Siby Thachil.

Time-of-Flight (ToF) cameras provide real-time depth information along with color image at a frame rate of 30 fps. ToF cameras illuminate the scene with an active light source and a built-in photodetector captures the reflected light. Depth data is extracted by calculating the total time of flight for the emitted light pulses. When two or more ToF cameras are present in a scene, mutual interference may be significantly high. Due to the limited processing power and memory, manufacturers deploy simple solutions such as hardware counter to minimize the overlap between illumination periods. To reduce the mutual interference, three schemes are explored: smart round-robin, multi-frequency operation and randomization of illumination. When connected in a ring, smart round-robin scheme enables automatic synchronization and hot swapping of up to 8 cameras. Multi-frequency operation signifies running different cameras at different base frequencies. The third approach randomizes illumination applying by applying pseudo-random codes. I will present the schemes and the initial observations.

## 03.08.2017- 10:00: "Improved Wavefront Parallel Processing for HEVC Decoding". Philipp Habermann.

Wavefront Parallel Processing (WPP) is a high-level parallelization tool adopted in the High Efficiency Video Coding standard (HEVC/H.265). WPP allows the simultaneous processing of consecutive rows of coding tree units (CTUs) in a frame. A horizontal offset of at least two CTUs is commonly implemented to maintain all dependencies. This causes ramp-up and -down issues in active parallel decoding threads, which limits parallel scalability. However, a two CTU offset is not necessary for the whole decoding process. In this paper, we propose a fine-grained synchronization approach to address the ramping problem. A theoretical analysis of the decoding dependencies is performed to estimate the potential speed-up with this approach.

## 27.07.2017- 10:30: "M. K. Qureshi: “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs”". Tareq Alawneh.

In this talk, I will present and discuss a paper by Moinuddin K. Qureshi published at the 15th International Symposium on High Performance Computer Architecture. In private CMP cache organization, the Last Level Cache (LLC) is statically partitioned among all the cores. Such cache organization does not employ capacity sharing in response to the requirements of individual cores, which in turn may incur more cache misses. In this paper, the author proposed a Dynamic Spell-Receive (DSR), in which each cache can spill the evicted cache lines to other caches or receive them but not both. The DSR architecture differs from previous related work, Cooperative Caches (CC), by taking into account the cache requirements of different cores.

## 27.07.2017- 10:00: "Huangfu et al.: “Static WCET Analysis of GPUs with Predictable Warp Scheduling”". Jan Lucas.

In this talk I'm presenting a paper by Yije Huangfu and Wei Zhang from the 20th International Symposium on Real-Time Distributed Computing. GPUs offer very high throughput and very good average case performance. However, when we want to use them for real-time applications we need to guarantee a good enough worst-case execution time instead. Huangfu and Zhang present a WCET analyser for GPUs with a small hardware modification to provide a predictable GPU warp scheduling policy.

## Thursday July 20, 2017, 10:30, EN 643/644: "Perceptron Learning for Reuse Prediction". Elvira Teran.

The disparity between last-level cache and memory latencies motivates the search for efficient cache management policies. Recent work in predicting reuse of cache blocks enables optimizations that significantly improve cache performance and efficiency. However, the accuracy of the prediction mechanisms limits the scope of optimization. This paper proposes perceptron learning for reuse prediction. The proposed predictor greatly improves accuracy over previous work. For multi- programmed workloads, the average false positive rate of the proposed predictor is 3.2%, while sampling dead block prediction (SDBP) and signature-based hit prediction (SHiP) yield false positive rates above 7%. The improvement in accuracy translates directly into performance. For single-thread workloads and a 4MB last- level cache, reuse prediction with perceptron learning enables a replacement and bypass optimization to achieve a geometric mean speedup of 6.1%, compared with 3.8% for SHiP and 3.5% for SDBP on the SPEC CPU 2006 benchmarks. On a memory-intensive subset of SPEC, perceptron learning yields 18.3% speedup, versus 10.5% for SHiP and 7.7% for SDBP. For multi- programmed workloads and a 16MB cache, the proposed technique doubles the efficiency of the cache over LRU and yields a geometric mean normalized weighted speedup of 7.4%, compared with 4.4% for SHiP and 4.2% for SDBP.

Elvira is a Ph.D. Candidate from Texas A&M University. She received a B.S. in Computer Science from The University of Texas at San Antonio. During her studies she has worked under the supervision of Prof. Daniel A. Jiménez. Her research focus on improving performance in the memory hierarchy. She will be graduating this coming August.

• Title: Perceptron Learning for Reuse Prediction
• Presenter: Elvira Teran - Texas A&M University
• Date and time: Thursday July 20, 2017, 10:30 - 11:30
• Room: E-N 643/644 (AES seminar room)

## 20.07.2017- 10:00: "Enable online training for deep-learning: Transfer Learning and Approximate Computing Approach". Ahmed Elhossini.

Deep-learning (DL) is an efficient tool for machine learning and specifically for computer vision. The main advantage of deep-learning is that it combines the two main phases of computer vision algorithms, feature extraction and classification. Both stages are learned directly from the training date. This reduces the effort to engineer the feature extractor and the classifier. Successful DL models, in the form of Convolutional Neural Networks (CNN), have complex topologies with tens or hundreds of layers and are trained with large data-sets containig millions of images. The training process is a very time consuming task and requires high computational power. For that reason, the recent success of DL was only possible because of the recent advances in Computer Architecture and Multi-core/Many-core architectures such GPUs. However, this makes the training not possible on embedded hardware with low computational power. For this reason, all the recent work to accelerate DL on embedded hardware for real time applications focuses mostly on the inference/prediction stage (computing the output of a trained network), leaving the training stage to be performed off-line (pre-trained networks) or to performed remotely on more power-full hardware. In this presentation, we will present some ideas to accelerate the training stage on embedded hardware. This will allow on-line training (the system can learn new information during its execution, from new data) without the need to send training data for remote processing. This approach will reduce the computation power and power budget for the training stage in general as well. The proposed approach depends mainly on the concepts of transfer learning and approximate computing.

## 13.07.2017-10:00: "Local Memory Aware Kernel Perforation". Daniel Maier.

Trading accuracy for performance - Approximate Computing - has been investigated using hardware- and software-based solutions. One of the widely used platforms are GPUs, which provide high throughput because of their massively parallel architecture and their memory model. Programming GPUs and utilizing a reasonable part of their computational power can be quite challenging. Moreover, for software-based approximation techniques, the new opportunities offered by GPU architecture have not been thoroughly explored. In this work, we propose to use local memory aware kernel perforation. The key idea is to introduce an approximative prefetching phase, in which part of the data, that needs to be fetched from global memory, is instead approximated by using computation in local memory. In this presentation, I am going to briefly introduce the state of the art, present my approach, the ongoing work and the next steps.

## 06.07.2017: "A Comprehensive Performance Analysis of Vulkan Compute vs OpenCL and CUDA". Nadjib Mammeri.

The recent introduction of the Vulkan API along with SPIR-V, an intermediate language for native representation of graphical shaders and compute kernels, provides GPU programmers with a new programming model for writing and tuning GPGPU applications. The Vulkan specification defines four types of queues: graphics, compute, transfer and sparse. Vulkan is flexible in the sense that the driver is not only allowed to support compute, there are cases where a Vulkan driver could only expose compute. Hence, this presents another compute route that could be explored along with OpenCL and CUDA. However, there is lack of scientific data comparing this new programming model versus established programming models such as CUDA and OpenCL.

In this talk, I’ll present my work on a paper comparing Vulkan compute vs CUDA and OpenCL. I will present the plan and methodology for my work, provide you with updates about my progress and share very initial results I have obtained so far.

## 29.06.2017-10:00: "SLC: Selective Lossy Compression to Leverage Memory Access Granularity". Sohan Lal.

Modern GPUs provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. However, the effective compression ratio is much less due the restrictions of memory access granularity (MAG). The reason for that is while using a lossless compression technique, many blocks are compressed in such a way that their size is few bytes more than the multiple of MAG, but we have to fetch the whole burst to satisfy the restrictions of MAG. These extra bytes significantly reduce compression ratio and hence the performance gain. In this work, we propose to use a selective lossy compression (SLC) which leverages MAG.

The key idea is to selectively approximate blocks if the extra bytes are within the user defined threshold. In this talk, I will present two way sof selectively approximating blocks and some initial results.

## 22.06.2017: "Overview of Ongoing Research Activities". Biagio Cosenza.

In this presentation, I will give an overview of my ongoing research work on automatic tuning with machine learning, compiler optimizations, domain-specific languages and programming models for parallel architectures. The presentation will conclude with few notes about the Artifact Evaluation process.

## 08.06.2017-10:00: "Design and implementation of a convolutional filter for deeplearning". Lars Schymik.

Convolutional Neural Networks, also known as deep neural networks are getting more and more important when it comes to live pattern recognition in abstract data, especially for speech recognition e.g. in mobile phones or image recognition e.g. of objects/faces in automobile (Tesla's car autopilot) or smart home environments. In Image processing, deep nets aim to find a set of locally connected neurons, which form so called features of the object to recognize, where classic neural nets just aim to learn a full weight matrix between two connected net layers. Feature recognition and learning in CNNs is done using convolution filters, where the filter kernels are not fixed, but data-specific.

Convolutional Filters are computationally expensive, so the next technological step after using GPU power through OpenCL or CUDA is to parallelize filter operations as much as possible to speed up learning but also classification time. In this presentation I explain my VHDL design of a parallel convolutional filter stage for a CNN in a Xilinx ZYNQ SoC and show performance results compared to current solutions as well as give an outlook on possible optimizations and limits.

## 01.06.2017- 10:00: "Correct and reliable construction of complex embedded real-time systems". Dr. Paula Herber.

About 98% of computing devices are embedded in all kinds of electronic equipment and machines. At the same time, failure of embedded systems often has serious consequences, such as huge financial losses or even loss of lives. Thus, the correctness and reliability of embedded systems are of vital importance. To achieve this, a clear understanding of the models and languages that are used in the development process is needed. Formal methods provide a basis to make the development process systematic, well-defined, and automated. However, for many industrially relevant languages and models, such as Matlab/Simulink and SystemC, a formal semantics does not exist. Together with the restricted scalability of formal design and verification techniques, this makes the correct and reliable construction of embedded systems a major challenge. In this talk, I will summarize some of my contributions and current research activities in this field.

## 18.05.2017-10:00: "E²MC: Entropy Encoding Based Memory Compression for GPUs". Sohan Lal.

Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. However, compression is not free and its challenges need to be addressed, otherwise the benefits of compression may be offset by its overhead. In this presentation, we present an entropy encoding based memory compression (E²MC) technique for GPUs, which is based on the well-known Huffman encoding. We study the feasibility of entropy encoding for GPUs and show that it achieves higher compression ratios than state-of-the-art GPU compression techniques. The average compression ratio of E²MC is 53% higher than the state of the art. This translates into an average speedup of 20% compared to no compression, which is 8% higher compared to the state of the art. Furthermore, we further show that energy consumption and energy-delay-product are reduced by 13% and 27%, respectively.

## 11.05.2017- 10:00: "Vectorization in LLVM". Angela Pohl.

LLVM currently utilizes two different techniques for automatic vectorization. I have conducted experiments to analyse their current capabilities, and will present results from my quantitative and qualitative analysis. Furthermore, I will talk about my next steps for enabling more code patterns to be vectorized, and give a brief overview of the student theses I am currently supervising.

## 27.04.2017- 10:30: "Development of a RICH particle identification algorithm on Intel’s Knight’s Landing Platform". Christina Quast.

At CERN, particles are collided in order to understand how the universe was created. In the LHCb experiment, proton-proton collisions are used to investigate the matter-antimatter asymmetry of the universe.  An non-optimized implementation of the pattern recognition algorithm for the current CPU hardware exists, written in C++. For my master thesis the algorithm will be ported to and optimized for the Intel's Xeon Phi Knight's Landing (KNL) processor, a CPU designed for HPC (High-Performance Computing).

The memory accesses and data representation in memory for the program have to be researched in detail to determine the best optimization techniques which should be applied for the RICH pattern detection algorithm used in the LHCb experiment. The hotspots of the code need to be identified and optimized first, because they will have the biggest influence on the performance. Parallelization and vectorization will be applied in order to make use of all the resources provided by the Knights Landing processor.

## 13.04.2017-10:00: "High-performance video decoding using GPUs". Biao Wang.

The increasing requirements of high-resolution/high-quality videos can lead to a challenging computational demand for video decoding on conventional CPU architecture. Graphics Processing Units (GPUs) as another alternative  general-purpose computing architecture provide higher computational power in general than CPU. This research exploits whether GPUs can be effectively used for the video decoding application, although it is not completely suitable for the GPU architecture.

In this study, the video decoding processing stages of inverse transform, motion compensation,  and in-loop filters are parallelized and optimized for GPUs. Furthermore, a highly parallel CPU+GPU decoder for the latest HEVC standard is proposed, where it simultaneously exploits decoding parallelism on CPU, GPU, and between the CPU and GPU devices. Compared to highly optimized CPU implementation, the proposed individual kernels achieve up to 26X speedups. For the complete HEVC CPU+GPU decoder, a speedup of 2.2X is achieved at the application level.

Experimental results show that GPUs can be effectively used to assist video decoding, provided that the video decoding workload is carefully assigned to the CPU and GPU devices, and the available decoding parallelism is appropriately exploited on them, as performed in this research.

## 23.03.2017-10:30: "A Quantitative Analysis of the Memory Architecture of FPGA-SoCs". Matthial Göbel.

In recent years, so called FPGA-SoCs have been introduced by Intel (formerly Altera) and Xilinx. These devices combine multi-core processors with programmable logic. This presentation analyzes the various memory and communication interconnects found in actual devices, particularly the Zynq-7020 and Zynq-7045 from Xilinx and the Cyclone V SE SoC from Intel. Issues such as different access patterns, cache coherence and full-duplex communication are analyzed, for both generic accesses as well as for a real workload from the field of video coding. Furthermore, the presentation shows that by carefully choosing the memory interconnect networks as well as the software interface, high-speed memory access can be achieved for various scenarios.

## 23.03.2017-10:00: "Microbenchmarking the GTX580 Memory Interface". Jan Lucas.

NVidia's GTX 580 GPU is reported to have six 64-bit wide memory channels, but is that really true? In this talk a novel microbenchmarking technique is used to reveal new information about the internal topology of the memory interface. We reveal the real number of memory channels and show how the mapping of addresses to memory channels can be uncovered. Data-dependent power consumption in the memory interface was also investigated and early results will be presented.

## 16.03.2017-10:30 : "Embedded Load Balancing". Robert Hering.

Performance of modern embedded systems becomes increasingly multivariate and complex. Multiple performance values have to be tuned by multiple parameters while complex dependencies emerge due to limited system resources and changing environmental system conditions. As manual optimizations by hand using experience with the platform thus become impractical a new demand for automatic tuning of performance evolves. In this master thesis, a generic optimization approach is developed using machine learning with focus on higher performance embedded systems. The optimization algorithm is presented followed by discussion of evaluation results and an optional live demonstration.

## 16.03.2017-10:00 : " A Memory Architecture for Data Access Patterns in Multimedia Applications". Tareq Alawneh.

As the speed gap between CPU and external memory widens, the memory latency has become the dominant performance bottleneck in modern applications. Caches play an important role in reducing the average memory latency. Their performance is strongly influenced by the way data is accessed. Numerous multimedia algorithms, which operate on data such as images and videos, perform processing over rectangular regions of pixels. If this distinctive data access pattern as well as other data access patterns are exploited properly, significant performance improvements could be achieved. In this paper, a new memory organization exploiting 2D, stride and sequential data access patterns, exhibited in multimedia applications, is proposed. It aims at reducing the memory access latency, lowering the number of memory accesses and utilizing the bandwidth efficiently.

## 09.03.2017- 10:00: "Adaptive face recognition using Convolutional Neural Network". Ahmed Chamas.

Recently, deep learning became one of the most important topics in computer vision, which is a branch of machine learning. Connvolutional Neural Network (CNN), is a new efficient tool for deep learning which can automatically learn features from the input image to achieve a high recognition rate in the classification problem. The aim of this thesis is to investigate machine learning techniques, and specifically CNN in the problem of recognizing the human faces. The main objectives are to find ways to speed up the training and accuracy rate. However the main challenge is to make the network model adaptive, which means, the network model should be able to learn more classes ( identity of a person) like the human brain works, without affecting the accuracy of the recognition of the older classes.

Transfer learning is a technique that can be used each time, a change in the dataset occurs ( in our case a new person ). This technique is realized by taking the pre-trained CNN weights, replacing the last layer which is the fully connected layer and training it with the pre-trained convolutional layer. Yale Extend dataset, combined with our own dataset, is used to train the CNN model. The dataset has varying illumination conditions and poses. The initial results show that the CNN model achieved a good performance, with over 97% test accuracy before adding new classes. In this presentation, we will explain the architecture of the CNN model, discuss the transfer learning process and finally we present a demo for this CNN model.

## 23.02.2017 - 10:00: "Syntax Element Partitioning for high-throughput HEVC CABAC Decoding". Philipp Habermann.

Encoder and decoder implementations of the High Efficiency Video Coding (HEVC) standard have been subject to many optimization approaches since the release in 2013. However, the real-time decoding of high quality and ultra high resolution videos is till a very challenging task. Especially entropy decoding (CABAC) is most often the throughput bottleneck for very high bitrates. Syntax Element Partitioning (SEP) has been proposed for the H.264/AVC video compression standard to address this issue and he limitations of other parallelization techniques. Unfortunately, it has not been adopted in the latest video coding standard, although it allows to multiply the throughput in CABAC decoding. We propose an improved SEP scheme for HEVC CABAC decoding with eight syntax element partitions. Experimental results show throughput improvements up to 5.4x with negligible bitstream overhead, making SEP a useful technique to address the entropy decoding bottleneck in future video compression standards.

## 02.02.2017-10:00: "HEVC acceleration on MPPA". Ahmed Zaky.

HEVC decoder developed by TU Berlin group is optimized for multiple architectures like X86 and ARM, in which techniques like SIMD are used to improve the speed of decoding. The aim of this master thesis is to show a proposed acceleration method for the HEVC decoder using a new architecture called "Kalray MPPA Manycore processor." The MPPA setup consists of Host processor, IO processor, and 256-core general purpose processors which are divided into 16 clusters. The main challenge is to split the already available HEVC decoder code (which is implemented as one process with multi-threads) into three processes to accommodate Kalray MPPA architecture (Host processor, IO processor, and Cluster general purpose processor). To realize this implementation, a simulated version for functional split and communication model is implemented on Linux based X86 PC. Furthermore, a ported version of the HEVC decoder to the MPPA platform is implemented which uses one thread of 1 Cluster and the corresponding decoding speed is evaluated. A detailed analysis of the memory constraints and communication overhead of the proposed solution is represented showing the bottlenecks in IO and Cluster communication. Finally, as a conclusion, the maximum theoretical decoding rate, recommendations, and future work are presented. These recommendations include increasing utilization of the MPPA cluster resources and improving communication between the IO and the Cluster modules.

## 19.01.2017-10:00: "Automated Characterization of GPU Features Using OpenCL Microbenchmarks". Felix Goroncy.

Graphics Processing Units (GPUs) have gained high importance during the past two decades. In recent years GPUs have shown huge improvements in their performance. Development has been pushed towards more general purpose programmability, introducing the term General Purpose Graphics Processing Unit (GPGPU). With the emergence of high-level languages such as CUDA and OpenCL, GPUs have found use in a myriad of applications outside the graphical domain.

Using high-level languages simplifies software development and allows device independent development. For OpenCL development is also vendor independent. Writing highly optimized code, however, is very hardware and device dependent and cannot fully benefit from the abstraction layers. Detailed knowledge about the inner workings of the GPU is vital but finer architecture aspects are often not documented.

We provide a framework of OpenCL microbenchmarks to uncover device specific architectural details of GPUs from different hardware vendors. We show results for the inferred cache architecture and estimation of access delays for the cache and memory of the PowerVR Rogue G6430 embedded GPU, the integrated GPU of the AMD E-350D APU and GeForce’s TITAN X. In addition, SIMD width, branch divergence behaviour and number of compute units are revealed for the mentioned devices and more benchmarks are still in development.

## 12.01.2017-11:00: "Investigation of real-time Ethernet based fieldbus protocols for power electronic devices". Sanjay Santhosh.

Precision real time control and synchronization is important in systems with several coupled power electronic devices. For a precision control of a distributed system we need to have a high efficiency fieldbus system. Real-time Ethernet based fieldbus protocols have gained a huge significance in industrial automation. A new real-time Ethernet protocol is proposed for the point to point communication of power electronic drives. The new protocol is to have higher performance and reduced complexity compared to the existing hardware assisted RTE protocols. The new protocol is to support line topology of controllers allowing them to communicate with each other and with the master (with minimum latency) and to be synchronized with each other.

The thesis studies the existing real-time Ethernet protocols and analyzes their real-time capabilities and proposes a new real-time Ethernet protocol based on the previous studies. The newly proposed protocol is implemented and a performance analysis is carried out and is compared with that of the existing protocols.

## 12.01.2017-10:00: "Optimising HEVC Decoding using Intel AVX512 SIMD Extensions". Xu CAO.

SIMD instructions have been commonly used to accelerate video codecs. The recently introduced HEVC codec like its predecessors is based on the hybrid video codec principle, and, therefore, also well suited to be accelerated with SIMD. The new Intel SIMD extensions, AVX512 SIMD extensions, enables processing of more data elements than previous SIMD extensions, such as SSE, AVX and AVX2, with one single instruction. In this thesis, the difference between AVX512 and other SIMD instructions like AVX2, SSE and the advantage of using AVX512 SIMD extensions to optimise the HEVC decoding are presented. The AVX512 SIMD extensions are applied to the potential beneficial HEVC decoding kernels (Inter prediction, Inverse Transform and Deblocking Filter). The performance evaluation results based on 1080p(HD) and 2160p(UHD) resolutions video sets under Intel Software Development Emulator(SDE) indicate that by using AVX512 SIMD extensions, for 1080p 8-bit up to 23% execution instructions can be decreased compared to the HEVC optimised decoder based on AVX2 and up to 31% can be decreased for that of 2160p 10-bit. When comparing AVX512 optimisation with the scalar implementation, 85% and 83% execution instructions can be reduced for 1080p and 2160p, respectively.

## 15.12.16- 10:00: "Boosting Vectorization in LLVM". Angela Pohl.

Vectorization is a key technique to increase application performance. In modern architectures, special hardware registers have been added for this purpose, executing the same calculation on multiple data chunks in parallel, typically exploiting a program's data level parallelism. Ideally, code should be vectorized automatically by today's compilers, supplying an ideal mapping of the application to these special purpose registers and instructions. The reality is, though, that compilers lag behind compared to manual optimizations with specialized instructions, so-called intrinsics. In my research, I investigated the state-of-the-art vectorization capabilities of popular compilers and identified opportunities to improve the overall vectorization rate. In this presentation, I will show these analysis results for the LLVM compiler and discuss how I was able to improve the compiler's two vectorization passes, yielding a significantly higher vectorization rate and hence improving the average speedup of the used test patterns.

## 01.12.2016-10:00am: "GEMU: A Simulation Framework for GPUs powered by Vulkan® and SPIR-V". Nadjib Mammeri.

The recent introduction of the Vulkan API along with SPIR-V, an intermediate language for native representation of graphical shaders and compute kernels, promises cross-platform access, high-efficiency and better performance of graphics applications. However, non of the currently available GPU simulators support these new Khronos standards. The GEMU project here at the AES group, addresses this issue by developing a new fully fledged GPU simulation framework. GEMU provides a research platform for exploring GPU architectures with the latest APIs. In this talk, I’ll briefly introduce Vulkan and SPIR-V, describe GEMU in greater detail and discuss future work and research ideas.

## 24.11.2016-10:30am: "Perceptually Enhanced HEVC/H.265 Encoding: Focus on Smooth Regions". Martin Hennig.

As the name suggests, a main goal of High Efficiency Video Coding (HEVC) is efficiency. Therefore, HEVC uses Rate Distortion Optimisation (RDO) to achieve an optimal performance. The challenge is to find the best result in terms of quality per bit rate. A problem is the definition of quality in this context. Within RDO, the quality is currently rather objectively defined without taking into account any perceptual features of the Human Visual System (HVS). However, experiments have shown that the HVS is not consistently sensitive to the input. Thus, there are many approaches to allocate the bit rate accordingly to the perceptual sensitivity of the input. To improve the perceptual performance, this thesis follows the concept of adapting the Lagrangian multiplier to the perceptual sensitivity for each Coding Tree Unit (CTU) depending on its perceptual sensitivity. Therefore, the algorithm is implemented in the TUB-HEVC and tested on several sequences.

## 24.11.2016-10:00am: "Memory Subsystem Performance Evaluation of the Kalray MPPA". Beatrice Tirziu.

The actual implementation of a memory-demanding application like the HEVC decoder on many-core systems should be preceded by a performance evaluation of the memory subsystems. The goal is to determine the peak throughput and to establish whether the memory accesses limit the overall performance. The evaluated many-core processor is the Kalray MPPA-256 Andey processor. The processor integrates 256 user cores and 32 system cores on a chip, each running at a frequency of up to 400 MHz. The cores are organized in four peripheral I/O subsystems and 16 compute clusters, each compute cluster having 2 MB of shared memory. Micro-benchmarks have been implemented to test the performance of the different communication paths: host memory to MPPA's off-chip memory, MPPA's off-chip memory to compute clusters' shared memories and between the compute clusters' shared memories. Multiple compute clusters are considered, each with up to 16 active cores.In addition, a trace-based benchmarking of the motion compensation and write back stages of the HEVC decoder is performed. Whereas the two stages require a high memory bandwidth, it is important to predict whether an implementation would satisfy real-time requirements.

The results of the micro-benchmarks show that no more than 2 GB/s can be achieved for the communication with the host CPU. For the communication between the MPPA's off-chip memory and the compute clusters shared memories, the maximum bandwidth obtained is 3 GB/s for reading from the off-chip memory, while writing is limited to 1.5 GB/s. The best results are obtained for the largest transfer sizes, the performance decreasing with the transfer size. In the case of compute cluster to compute cluster communication, a bandwidth of around 11 GB/s is obtained for eight pairs of communicating clusters. For the HEVC-trace based benchmarking, the best results of around 1.44 frames processed per second are obtained for Full HD videos. Consequently, the real-time requirements of 60 frames per second cannot be satisfied. The implemented benchmarks reveal that the memory accesses are very costly, especially for small transfer sizes. Therefore, a high number of such memory accesses might lead to performance bottlenecks.

## 10.11.2016-10:00am: "Speichervirtualisierung und -verwaltung für FPGA-SoCs". Ilja Behnke.

When using FPGA-SoCs to solve complex problems heterogeneously, the developer has to tackle the issue of coherence in memory management. While the RAM is managed with virtual addresses by the operating system, the IP Core has to be able to access it without involving the CPU. To avoid bottlenecks this should happen without a significant increase in cost for the development or during run time. To access the memory from the FPGA it is necessary to translate addresses in between. This bachelor thesis deals with the design and implementation of an MMU IP core that enables the programmable logic to access the RAM. I also implement the necessary hardware and software infrastructure to use and evaluate the MMU core. In comparison to an alternative implementation without a separate MMU, the performance of the developed IP core is ca. 62.5% higher. At the same time the performance is somewhat smaller when comparing to a solution with dedicated memory and without virtual memory management. To the developer, the use of the MMU core is almost transparent.

## 03.11.2016-10:00am: "E²MC: Entropy Encoding based Memory Compression for GPUs". Sohan Lal.

Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. However, compression is not free and its challenges need to be addressed, otherwise the benefits of compression may be offset by its overhead. In this presentation, we present an entropy encoding based memory compression (E²MC) technique for GPUs, which is based on the well-known Huffman encoding. We study the feasibility of entropy encoding for GPUs and show that it achieves higher compression ratios than state-of-the-art GPU compression techniques. We address the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. The average compression ratio of E²MC is 53% higher than the state of the art. This translates into an average speedup of 20% compared to uncompressed memory, which is 8% higher compared to the state of the art. Furthermore, we further show that energy consumption and energy-delay-product are reduced by 13% and 27%, respectively.

## 20.10.2016-10:00am: "Efficient Real Time Face Recognition". Ahmed Elhossini.

Face recognition is an important task in many applications. In this work we investigated the use of Convolutional Neural Networks for face recognition. The new approach we present in this research will allow the network to learn new faces and update its structure accordingly. We also present the possible hardware architecture that can be used to implement this network.

## 13.10.2016-10:30am: "Design und Implementierung ausfallsicherer Firmwareupdates von hochverfügbaren Betriebsgeräten". Luise Moritz.

Due to new features of a device a firmware needs to be updated. But how to update a device, which needs to be highly available? The purpose of this thesis is to develop an update unit for a highly available operating device. With this update unit the device can remain in its system and process the update data, while it keeps its full functionality. Special attention is paid to the discovery and handling of errors during the update process. First, requirements for the update process and variations of the update unit are developed. These variations are discussed with the requirements of the update process. From these variations the update process and implementation of the update unit is described and evaluated.

## 13.10.2016-10:00am: "Analysis and Extension of the Intel SPMD Program Compiler". Daniel Schürmann.

The Intel SPMD Program Compiler (ispc) is a new research programming language to easily write vectorized code while maintaining high performance. However, due to a different programming model used in this language and missing language features, the obtained performance can fall short of expectations. The language's type system and control flow constructs are analyzed and a compiler extension was written to output additional information from the compiler's front-end. With regard to performance, two applications are optimized and benchmarked. Various issues of ispc, like the missing support for templates and missing standard optimizations, are identified and workarounds discussed.

## 06.10.2016-10:00am: " Sharing is caring: Making scholarly output freely available". Michaela Voigt (Open Access librarian at TU Berlin).

Readers want it, funders want it, libraries want it – free and open access to scholarly output. Studies have shown that open access publications have a citation advantage. There are different ways to implement OA: the golden road by publishing with a OA publisher, or the green road by self-archiving papers that were published with a traditional, closed access publisher. Self-arching on a website is good, self-archiving on a repository is better! Why? We will discuss OA strategies in general and self-archiving strategies in particular, and present the university library’s self-archiving service and the tailored workflow for AES.

## 22.09.2016-11:00am: "Performance analysis of video playback for VR". Vincent Mühler.

As the popularity of Augmented- and Virtual Reality (VR) rises quickly in modern industry a lot of attention has been put into the development of VR applications such as omnidirectional VR video. Several internet platforms such as YouTube 360 arise where VR video content is distributed and which allow to stream VR video. However, this type of video content currently lacks in quality and consumes a lot of bandwidth due to it's high bitrate.VR video gives the consumer the freedom to navigate through an entire video scene, which is usually a panoramic or spherical scene. This poses the problem that the users perspective view can not be pre encoded itself as in the case of conventional video with predefined sequences of images. Therefore the common approach is to compress the entire scene involving the transformation and back projection of video content which causes the compression of redundant data as well as additional rendering overhead. Furthermore it introduces the requirement of particular video players for playback of VR video.

This thesis discusses the main challenges of VR video and investigates the performance bottlenecks as well as solutions in regards to playback of VR video. The main bottleneck remains in decoding of the video stream due to the high bitrate and the required resolution of VR video frames. It will be shown that a significant gain in performance can be achieved by utilizing the HEVC codec over the leading h.264 standard for compression of VR video. Furthermore an implementation of an HEVC decoder/ encoder is suggested which grants better compression efficiency and savings in bitrate as well as better performance in decoding high quality VR video compared to available open source solutions.

## 22.09.2016-10:00am: "Towards Skynet: Deep Learning on GPUs". Michael Andersch.

Deep Learning is a machine learning technique that’s revolutionizing industry after industry. The potential use cases are endless: From self-driving cars to faster drug development, from automatic image captioning to smart real-time language translation, Deep Learning is providing exciting opportunities wherever machines interact with the human world. To this day, one of the pillars behind this revolution has been the use of NVIDIA GPUs, enabling the first groundbreaking results as well as continuously driving performance up to support the advancement of the field. In this talk, we’ll take a look at what Deep Learning is about and discuss what it takes from a GPU software and architecture perspective to achieve great performance.

## 01.09.2016-10:00am: "SIMD Instruction Set Extension for a RISC-V Core". Pedram Zamirael.

In this project a Single Instruction, Multiple Data (SIMD) instruction-set extension for a RISC-V core was implemented.
The implemented instruction set extension is based on Intel's Streaming SIMD Extensions (SSE) and will provide a baseline implementation for further research of Application-Specific Instruction set Processor (ASIP)-based accelerate of video coding.
This presentation contains procedure of implementing and testing a new instruction set using Codasip Integrated Development Environment (IDE).

## 04.08.2016-11:00am: "​OpenCL kernel analysis for HPC". Won Tae Joo.

Compiler optimizations rely on code features in order to perform advanced transformations. Static code features extraction has been largely used, for instance, in the context of optimization frameworks based on machine learning. However, such approaches use approximated heuristics based on strong assumptions which limit the accuracy of the modeling. This work proposes an automated feature extraction framework for OpenCL kernels based on cost relations. By exploiting the information known by the OpenCL runtime, the proposed framework builds a set of cost relation features, which calculates each feature as a polynomial of the input variables (known at runtime). The method exploits a characteristic of OpenCL which is, the OpenCL is based on C99 standard, and does not allow recursive function calls. Results show that our approach is 190% more accurate than the state-of-the-art approaches based on purely static heuristics.

## 04.08.2016-10:30am: "​Optimizing PHP bytecode using type-inferred SSA form". Nikita Popov.

The PHP programming language is commonly used for the server-side implementation of web applications, powering everything from personal blogs to the world's largest websites. As such its performance is often critical to the response time, throughput and resource utilization of these applications.

The aim of this thesis is to reduce runtime interpreter overhead by applying classical data-flow optimizations to the PHP bytecode in static single assignment form. Type inference is used to enable the use of type-specialized instructions. Other optimizations include flow-sensitive constant propagation, dead code elimination and copy propagation. Additionally, inlining is used to increase the applicability of other optimizations.

The main challenge is to reconcile classical compiler optimizations, that have been developed in the context of statically typed and compiled languages, with a programming language that is not only dynamically and weakly typed, but also supports a plethora of other dynamic language features.

## 04.08.2016-10:00am: "HEVC acceleration using MPPA platform". Ahmed Zaki.

HEVC decoder developed by TU Berlin group is optimized for multiple architectures like X86 and ARM, in which technique like SIMD is used to improve the speed of decoding.

The aim of this presentation is to show a proposed acceleration method for the HEVC decoder using new architecture called "Kalray MPPA manycore processor". The MPPA setup consists of Host processor, IO processor, and 256-core general purpose processors which are divided into 16 clusters. The main challenge is to split the already available HEVC decoder code (which is implemented as 1 process with multi-threads) into 3 processes to accommodate Kalray MPPA architecture (Host processor, IO processor, and Cluster general purpose processor). In this presentation I will present the initial simulation of 3 processes implementation of HEVC decoder using Linux OS on X86-64 bit processor. This is the first step towards fully functional Kalray MPPA implementation.

## 28.07.2016-11:30am: "Bus Body Diagnostic Using an Onboard Prognostics System with Machine Learning". Andrew Berezovskyi.

In the past, maintenance was predominantly based on the schedule, and more recently, on mileage. These approaches lead to the excessive replacements and do not reduce the failure rate significantly. With condition-based maintenance, it is possible to achieve even more optimal schedule and replace parts only when their performance starts to drop.

In this thesis, the state-of-art approach to analysing single throw mechanical equipment is studied and applied. The angular displacement of the bus door leafs is used to train two logistic regression classifiers for determining whether the door opening and closing movement belong to an operational or a broken door leaf. Models are trained and evaluated on the data obtained from a test bus fitted with a broken door pillar bracket.

The results of this thesis are being incorporated into the next-generation bus body electronic system at Scania.

## 28.07.2016-10:30am: "Design and Implementation of a Real-time Tracking System for Free-Space Optical Communication". Konstantinos Vasiliou.

Free-space optical communication technology has a big potential to be used for high-speed point-to point applications. The possibility of high data rates, license-free use, and small terminal size makes it a promising alternative to radio frequency and fiber-optic communication. However, atmospheric perturbations, building motion and pointing misalignment can cause Angle-of-Arrival (AoA) changes which reduce the link availability. The purpose of this thesis is to design and implement a real-time system that measures the AoA changes and-and performs high-speed steering, enabling direct coupling of an optical beam to a Single Mode Fiber (SMF). The hardware-software solution was tested in laboratory conditions as well over a 1 km link yielding good results. At the end of this thesis, actual tracking error suppression by a factor of more than 10 has been achieved.

## 28.07.2016-10:00am: "Deep learning: from theory via virality to applications in business". Rasmus Rothe.

Rasmus is the founder of Merantix, a company in the space of artificial intelligence. Earlier this year he launched howhot.io, which attracted more than 50m unique visitors to date. He studied Computer Science with a focus on (deep) machine learning at Oxford, Princeton and ETH Zurich.

Rasmus will present the research behind howhot.io, some fun facts, how it became viral and how he intends to leverage the technology behind it to solve real-world problems with his new company, Merantix.

## 21.07.2016-10:00am: "Automated Characterization of GPU Features Using OpenCL Microbenchmarks". Felix Goroncy.

Graphical Processing Units (GPUs) have gained a lot of importance during the past decades. Recent years have shown huge improvements in their performance. Development has been pushed towards more general purpose programmability, introducing the term General Purpose Graphical Processing Unit (GPGPU). With the emergence of high-level languages such as CUDA and OpenCL GPUs have found use in a myriad of applications outside the graphical domain.

Using high-level languages simplifies software development and renders it intra- or even inter-vendor device independent. Writing highly optimized code, however, is very hardware and thus device dependent and can not benefit from the abstraction layers. Detailed knowledge about the inner workings of the GPU are vital but finer architecture aspects often not documented.

We provide a framework of OpenCL microbenchmarks to uncover device specific architectural details of GPUs from different hardware vendors. We show results for the inferred cache architecture and estimation of access delays for the cache and memory of the PoweVR Rogue G6430 embedded GPU, the integrated GPU of the AMD E-350D APU and GeForce’s TITAN X. In addition, the SIMD width, branch divergence behaviour and number of compute units are revealed for the mentioned devices and more benchmarks are still in development.

## 14.07.2016-10:00am: "Investigation of real-time Ethernet based fieldbus protocols for power electronic devices". Sanjay Santhosh.

Precision real time control and synchronisation are important in systems with several coupled power electronic devices. For a precision control in the distributed system we need to have a high efficiency fieldbus system. A new Ethernet based industrial network is being proposed for the point to point communication for the drives. The new system aims at reducing the complexity of the current system and is expected to remove the main control unit. The new system is also expected to handle daisy chained topology of controllers allowing them to communicate with each other and with the master (with minimum latency) and to be synchronized with each other.

In this thesis a new real-time Ethernet based industrial network technology will be proposed. The thesis will do a classification of the existing technologies and do a comparison of the technologies. The newly proposed technology will be compared to the existing systems and the performance of the system would be analyzed. The thesis would also involve the design of the slave controller for the new technology.

## 07.07.2016-10:00am: "Autotuning Stencil Computations with Structural Ordinal Regression Learning". Biagio Cosenza.

Stencil computations expose a large and complex space of possible equivalent implementations. These computations often rely on autotuning techniques, based on iterative compilation or machine learning (ML), to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task, which may be unaffordable in many scenarios. Meanwhile, traditional machine learning autotuning approaches exploiting classification algorithms (like neural networks and support vector machines) face difficulties in capturing all features of large search spaces. This paper proposes a new way of automatically tuning stencil computations based on structural learning. By organizing the training data in a set of partially-sorted samples, i.e., rankings, the problem is formulated as a ranking prediction model, which translates to an ordinal regression problem. Our approach can be coupled within an iterative compilation method or used as a standalone autotuner. We demonstrate its potential by comparing it with state-of-the-art iterative compilation methods on a set of nine stencil codes.

## 30.06.2016-10:00am: "Improving performance scalability of a heterogeneous CPU+GPU HEVC decoder". Biao Wang.

The High Efficiency Video Coding (HEVC) standard provides state-of-the-art compression efficiency at the cost of increased computational complexity, which makes real-time decoding a challenge for high-definition, high-bitrate video sequences. An attempt has been made to offload computation intensive decoding kernels to Graphic Processing Units (GPUs). The performance of proposed CPU+GPU decoder, however, stops scaling when there are many CPU cores employed. This presentation first analyzes the bottlenecks where the performance of the CPU+GPU decoder stop scaling. Then a couple of solutions are discussed to improve the performance scalability.

## 23.06.2016-10:30am: "Scaling HEVC Decoding to Many-cores". Julian Gog .

The High Efficiency Video Coding-standard (HEVC), published by the Joint Collaborative Team on Video Coding (JCT-VC), doubles the compression efficiency of video sequences in comparison to its predecessor, Advanced Video Coding (AVC). To improve the parallelization of the video standards, the AVC standard has undergone some improvements. For HEVC, the latest standard, the TU-Berlin developed some new parallelization strategies, like Wavefront Parallel Processing (WPP) and Overlapped Wavefront (OWF). The latest of these methods this thesis is based on is called Overlapped Multiframe Wavefront (OMWF). As its name implies, it can deal with more frames in flight than the predecessors.

In this thesis a strategy will be presented to improve the scheduler. Splitting the existing numbers of threads to the running processes is the key to make the decoding process more efficient and faster. This thesis will present a framsize based model the scheduler can use to decide, how many threads should work on a decoding frame. With the model this thesis will present, the scheduler is able to predict the amount of time the decoding process would need to decode a frame. So the predicted time can be used as a base for the decision of how many threads should work on a frame.

## 23.06.2016-10:00am: " Design and Implementation of a Temperature Based Fan Controller for Desktop Computers". Arne Salzwedel.

Most common fan controllers are integrated in the mainboard of a PC. They have limited configuration options and the speed control is only based on CPU temperatures or measurements on the mainboard. In this presentation, a new concept of fan controller, that also uses GPU temperatures, is shown. The system consists of hardware and software components. A graphical user interface offers flexible configuration options.

## 16.06.2016- 10:00am: "Embedded Load Balancing". Robert Hering.

Performance of modern embedded systems becomes increasingly multivariate and complex. Multiple performance values have to be tuned by multiple parameters while complex dependencies emerge due to limited system resources and changing environmental system conditions. As manual optimizations by hand using experience with the platform thus become impractical a new demand for automatic tuning of performance evolves. In this master thesis, a generic optimization approach is developed using machine learning with focus on higher performance embedded systems. The current progress is presented together with promising simulation results using both test and real data.

## 09.06.2016-10:00am: "Memory bandwidth evaluation of the Kalray MPPA". Beatrice Tirziu.

The Kalray MPPA manycore processor is a 256-core general purpose processor that integrates 256 user cores and 32 system cores on a chip. The cores are organized in 16 compute clusters, each with 2 MB of shared memory and 4 peripheral I/O quad-cores. While the MPPA promises a high computational performance, for many memory demanding applications like the HEVC decoder, the memory bandwidth can limit the overall performance. A performance evaluation of the MPPA memory subsystem is therefore necessary to predict the upper bound performance of an actual implementation. We have thus implemented micro-benchmarks for the evaluation of the memory accesses. This presentation will feature the results of the different communication methods: host DDR3 to MPPA DDR3, MPPA DDR3 to compute clusters' shared memory and between the compute clusters' shared memories. The tests include communication between different number of clusters and between the clusters and the main memory, for multiple parallel transfers, number of threads and various data sizes. The next step is to perform trace-based HEVC benchmarking. The motion compensation stage of the HEVC decoder requires a high memory bandwidth and it is thus of great importance to test whether the memory accesses represent a bottleneck.

## 02.06.2016-10:00am: "Evaluation of performance of eGPU using stereo vision algorithm". Raam Mohan.

Stereo vision is a method of extracting 3D information of a scene using images taken from different viewpoints. There are many approaches for disparity map (DM) creation in a Stereo vision system. Embedded GPUs are low power GPUs embedded in the same chip along with the host CPU. Stereo vision is one of the suitable applications to be implemented in an embedded GPU. In this master thesis, the effect of parameter variations in a Stereo vision algorithm is studied. Software implementation of the algorithm is done and several optimizations are done to improve the execution time of the algorithm. Optimizations yielded 31X to 35X improvement in execution time depending on the platform implemented. The algorithm is then implemented on Nema eGPU, from ThinkSilicon Ltd. and its functional performance and execution times are studied. Comments are given on the eGPU architecture based on observations recorded. Future works are suggested along the field.

## 26.05.2016-10:30am: "Evaluating Nexus++ as a Tightly Coupled Coprocessor for ARM Multi-Core Processors". Haopeng Han.

Multi-core processors are becoming dominant in the computer industry.The number of cores inside a chip is still expected to increase in the next few years. The problem of the programmability emerged: how to efficiently utilize the multi-core processors? Task-based programming models such as OmpSs are promising for solving this problem. But runtime systems within these programming models could be a bottleneck that limits the performance. Nexus++ was originally designed as a loosely-coupled co-processor which communicates with an x86 multi-core processor via the PCIe links to boost the performance of the runtime system within OmpSs. This loosely coupled setup could be a limiting factor for the system performance. In this thesis work, we prototype Nexus++ on a state-of-the-art FPGA SoC -- Xilinx Zynq-7000 and evaluate it as a tightly coupled coprocessor for ARM multi-core processors. To this end, we first ported the VSs/VMS runtime library from X86-64 to ARM. Then we implemented Nexus++ in the Zynq SoC with two kinds of AXI protocols. Several new components including a Linux device driver, communication protocols, NexusIO, etc. were developed. We ran several OmpSs benchmark programs to evaluate Nexus++'s performance, and tested different configurations of our implementations to get a thorough evaluation. Because of the hardware limitations, we can only get test results for maximum two cores on the ZC706 development board. A simulation engine was therefore designed for evaluating Nexus++ with arbitrary number of cores. Evaluation results indicate that due to memory contentions, the runtime system VSs with hardware acceleration has similar performance with the pure software VSs runtime system. The simulation engine designed for Nexus++ works as expected. Simulation results show that Nexus++ scales well.

## 26.05.2016-10:00am: "A Trace-based Work Flow for Evaluating the Application-specific Memory Bandwidth of FPGA-SoCs". Matthias Göbel.

FPGA-SoCs such as Xilinx’s Zynq-7000 and Altera’s Cyclone V SoC provide a great platform for HW/SW-Codesigns. These devices combine a powerful embedded processor with programmable logic similar to that found in FPGAs. Thus, by using a HW/SW-Codesign approach and therefore partitioning an application into software and hardware parts, an implementation that combines the best of both worlds can be reached: The flexibility and ease of use of software and the high throughput and low energy consumption of hardware solutions. Due to the hard-core processor, the overall performance is higher than the one of a system using a soft-core processor in an FPGA.

For the communication between both parts, FPGA-SoCs provide various ports in the logic component that offer access to the DDR memory used by the processor. While high throughput inside an FPGA can be easily achieved for many applications, these ports and therefore the memory bandwidth limit the overall performance. This is especially troublesome as it is often challenging to define the according requirements in advance. Furthermore, the actually achievable memory bandwidth depends on many parameters, e.g. the width of memory accesses.

In this presentation, I will present a work flow that allows to estimate the required memory bandwidth by analyzing the memory trace of an equivalent pure software solution. The flow also includes mechanisms to simulate the behavior of a HW implementation by mimicking its memory accesses and measure the achieved bandwidth. Thus, it can be determined whether the effort of implementing a HW/SW-Codesign can be justified.

## 19.05.2016-10:00am: "Optimizing the HEVC decoding using AVX512 SIMD extensions". XU CAO.

SIMD instructions have been commonly used to accelerate video codecs. The recently introduced HEVC codec like its predecessors is based on the hybrid video codec principle, and, therefore, also well suited to be accelerated with SIMD. As the new Intel SIMD extensions, AVX512, enables processing of more number of data elements per instruction than previous SIMD extensions, such as SSE, AVX and AVX2. Increasing the SIMD width, however, does not equate to linear increases in performance, because of either application paralellism limitations and/or other architectural limitations. Therefore, it is important to analyse the potential performance difference between AVX512 and other SIMD instructions like AVX2, SSE for HEVC decoding. As the initial result we will present the a profile of the HEVC decoder to identify the potential total improvement and which functions could benifit the most.

## 12.05.2016-10:30am: "Syntax Element Partitioning for High-throughput HEVC CABAC Decoding". Philipp Habermann.

Context-based Adaptive Binary Arithmetic Coding (CABAC) is the entropy coder in the most recent JCT-VC video coding standard HEVC/H.265. Due to it's sequential algorithm, CABAC is one of the most critical throughput bottlenecks in video decoding, especially for high bitrate videos. Unlike other components of the HEVC decoder, there is no data level parallelism that can be exploited. High-level parallel CABAC decoding is supported in HEVC, e.g. by utilizing Wavefront Parallel Processing or Tiles. However, the implementation affects the coding efficiency and requires a duplication of the corresponding decoding hardware.

In this work we propose a tiny modification of the HEVC bitstream format that enables the parallel decoding of multiple bitstream sections with negligible impact on coding efficiency. Furthermore, the hardware cost is not expected to grow linearly with the number of parallel bitstream partition decoders as only parts of the decoder need to be duplicated. Simulation results show potential speed-ups up to 4.4x with eight parallel bitstream partitions, while adding less than 1% overhead to the bitstream. Even higher speed-ups can be expected due to the potential clustering and better customization of the decoding hardware.

All in all, the proposed bitstream format modification is a promising step towards very high throughput CABAC decoding which might be adopted in future video coding standards.

## 12.05.2016-10:00am: "Design and Implementation of an Improved Power Measurement Testbed for LPGPU2". Jan Lucas.

In this talk I will describe the design choices and the implementation of a new and improved power measurement testbed for the LPGPU2 project. While the first LPGPU project already had a power measurement testbed, it was designed for discrete GPUs while LPGPU2 employs embedded platforms based on SoCs such as the Nexus Player. The old testbed was a prototype that was not suitable for manufacturing. It also used a commercial DaQ with buggy closed source drivers incompatible with many versions of Linux and could not sample voltage and current at exactly the same time. The new power measurement device was designed by TUB, is suitable for manufacturing, spots a higher resolution input (24-bit instead of 16-bit) and samples all inputs at the same time. The design of the circuit and choice of parts will be described. Then I will explain how 6 MBit/s of sample data can be transmitted to the host using less than 20 KB RAM. Then the software running on the host PC is described and the communication using USB between host and device is detailed. Finally the new power measurement device will be presented and first measurement results will be shown.

## 28.04.2016-10:00am: "Web-Based HEVC/H.265 Decoding Using JavaScript". Thomas Siedel.

HEVC provides much better compression eﬃciency than its predecessor H.264. This would be especially beneﬁcial in the web environment where bandwidth is always critical. Currently support for HEVC in browsers is very limited, both with HTML5 and plugins such as Adobe Flash. This thesis describes a new approach where the decoding is done entirely in JavaScript directly in the browser. For this purpose, a modiﬁed version of an optimized HEVC decoder written in C++ is ported to JavaScript with the help of the Emscripten compiler. The average performance slowdown compared to the original code without optimizations is 1.6x in Firefox and 2.4x in Chrome for 1080p content. Performance optimizations using an experimental SIMD library for JavaScript (SIMD.js) and experimental multithreading are also implemented and evaluated. Using an experimental version of Firefox, which supports JavaScript features that are under development, it is possible to improve the performance by 1.6x using SIMD and additional 3.9x using multithreading with 4 threads. The ﬁndings demonstrate, that 1080p real-time decoding is possible on a standard desktop PC when experimental JavaScript features are used.

## 21.04.2016-10:00am: "Reduction of Radar Processing Latency". Tamilselvan Shanmugam.

Technology transforms everything simpler, faster and smarter. In an attempt to evaluate future Radar processor concepts, an analysis is carried out based on a mock-up of a Radar processing algorithm. An ARM platform is chosen keeping size, weight and power in mind. Several ARM Cortex A9 processors are involved to execute the Radar processing algorithm. An analysis done by Airbus DS investigates various means of scheduling Radar application to the ARM processors. Being safety critical software, determinism in terms of worst case execution time is a crucial factor. However, the analysis concluded that the worst case execution time in ARM processors are not in the acceptable range. ​

This thesis studies the bottlenecks imposed by the application, data dependencies between the processing chains and possible improvements. The implementation focuses on the parallelism in the Radar application and schedules them in optimal way to achieve best results. As a part of the investigation, peak processor utilization, peak memory utilization, peak bandwidth utilization, worst case execution time, bottlenecks and future scope are discussed in detail. An analysis is presented based on the new implementation schemes.

## 17.03.2016- 10:00am: "Optimizing PHP bytecode using type-inferred SSA form". Nikita Popov.

PHP is a dynamically typed programming language commonly used for the implementation of web applications, as such its performance is often critical to the response time and throughput of such applications. This thesis aims to reduce runtime overhead by improving the quality of the generated PHP bytecode, using data-flow optimizations that act on static single assignment (SSA) form annotated with inferred value range and type information. These optimizations include flow-sensitive constant propagation, dead code elimination, elimination of redundant computations through global value numbering and type specialization, as well as some optimizations specific to the PHP virtual machine. Inlining is used to increase the applicability of other optimizations. A primary challenge is to reconcile these optimizations with PHP's highly dynamic nature, which makes it hard to statically prove preconditions necessary for optimization, especially when considering real application code rather than artificial benchmarks.

## 10.03.2016-10:30am: "Exploration of the Interoperability of OpenCL and OpenGL ". Michael F. W. Kürbis.

Some applications, such as video playback engines, have the opportunity to utilize both the parallel computation API OpenCL and the graphics API OpenGL. In such a case, the interoperability mechanism between OpenCL and OpenGL is useful as it enables OpenGL to render a frame decoded with OpenCL on the GPU directly. In this study, the interoperability between OpenCL and OpenGL is explored across different platforms, with a focus on the exchange of data buffers between the APIs. A library to support the usage of this interoperability is presented, together with benchmark results.

## 10.03.2016-10:00am: "Analysis and Extension of the Intel SPMD Program Compiler". Daniel Schürmann.

The Intel SPMD Program Compiler (ispc) is a language to utilize SIMD hardware extensions. It aims for high efficiency and performance while keeping the simplicity of scalar code. ispc uses the Single Program Multiple Data (SPMD) programming model, where multiple instances of the program are mapped to individual SIMD-lanes of the processor, while the control flow of diverging program instances is handled by masking operations.

In this thesis, a new compiler mode "ispc explained" is written to get a better insight of the compiler, as well as the actual code. This compiler mode annotates the provided source code with types, overload resolution of function calls and gather/scatter operations. That way it is possible to find problems and difficulties in the compiler itself and in the code of vectorized programs.

## 03.03.2016-10:00am: "An Evaluation of Current SIMD Programming Models for C++". Angela Pohl.

SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. In this work, we evaluated current programming models for the C++ language, which claim to simplify SIMD programming while maintaining high performance. The proposals were assessed by implementing two kernels: one standard floating-point benchmark and one real-world integer-based application, both highly data parallel. In this presentation, we will discuss the different programming approaches and highlight the achieved performance gains, but also share the models' current drawbacks.

## 25.02.2016- 10:00am: "A Scheme to support two dimensional memory access in Image and Video Processing Applications". Tareq Alawneh.

The processor speed has increased much faster than the memory speed. Therefore, the expected performance improvements from increasing the processor speed are limited by the memory latency. Most of the algorithms, which work on image data exhibit a distinguished data access pattern. This data access pattern is called the two dimensional access (2D access). If this special data access is exploited properly, it could yield a significant improvement in the application performance.

The standard cache and memory organizations do not reflect the way that data are accessed by the image processing applications. As a result, they achieve poor performance when used for image applications. In this work, we propose a memory organization to exploit the 2D data access pattern in image processing applications.

## February 18, 2016- 10:30am: "Data Transformation Trajectories in Embedded Systems". Nawabul Haque.

A method of live trajectory estimation of UE (User Equipment) in LTE network has been proposed in the master thesis work. AoA (Angle of Arrival) and TA (Timing Advance) values are used to estimate the UE position. The calculated UE positions are filtered using Particle Filter (PF) to estimate a trajectory. The trajectory data is uploaded to a server. This data is then fetched and projected on the trajectory projection systems. The software developed as a part of the thesis runs on a network equipment located at RBS (Radio Base Station). A comparison between trajectories achieved without filtering and trajectories estimated with filter has also been presented. The error characteristics of the estimated trajectory has also been presented and discussed. Some future work to improve the trajectory estimation have also been proposed.

## February 18, 2016- 10:00am: "OpenCL Kernel Cost Analysis for HPC". Won-Tae Joo.

OpenCL is a framework for parallel computing which provides a programming model that can be used to program CPUs, GPUs and other devices from different vendors. It uses just-in-time compilation to enable the distribution of device independent code. The device independent representation for OpenCL is called SPIR (Standard Portable Intermediate Representation). Recently, Khronos Group, introduced a new version of SPIR, SPIR-V, as a new standard. SPIR-V is a high-level intermediate language which is abstract enough to guarantee device independence meanwhile maintaining information which can be used for maximizing performance on target devices.

Compiler transformations always requires in-deep analysis of the intermediate representation; a classic example is the extraction of static code features in order to train machine learning-based optimizer. The high abstraction and sufficient information contained in SPIR-V allows us to do a high-level analysis during compile-time. My thesis work addresses a portable method based on SPIR-V to analyze and extract code features from an OpenCL kernel. With respect to the state-of-art, whose approaches are based on simple approximation heuristics, my work will focus on cost relations: static code features of programs as a function of their input data size. Such methodology will enable compiler+runtime optimizer with higher accuracy than traditional static-only approach.

## February 11, 2016- 10:00am: "Design and Implementation of an FPGA-based Optical QAM Receiver for Real-Time Applications". Björn Böttcher.

Digital signal processing (DSP) has recently gained high importance in modern optical communication systems. DSP development is therefore a key for the design and development of optical transmission systems. To test new algorithms in real-time in laboratory experiments, a flexible infrastructure based on FPGA is highly desirable.

This presentation gives an overview about the basics of a digital coherent optical transmission system and discusses options for the realization of the FPGA platform. After the summary of the aim and the motivation of this thesis, the presentation depicts the development steps of the design and implementation phase. Further it discusses the results of a final System Experiment and gives a summary and a conclusion of the whole thesis.

## 04.02.2016, 10:30: "Preliminary investigation of wireless communication between medical devices to increase the device usability within minimally invasive procedures". Martin Schürer.

Minimally invasive surgery, also known as keyhole surgery, is an alternative to conventional surgeries. Within a laparoscopy the abdominal cavity is expanded by an insufflator insufflating CO2 into the abdomen. The insufflator is located on the endoscopic tower together with other devices that are required for MIS procedures, such as a camera, lighting, a pump and monitors. The tower stands outside the sterile environment. To change insufflation settings, for instance, to start or stop the insufflation, a common approach is that the surgeon commands a non-sterile surgery assistant to apply changes. This procedure can be inappropriate in terms of usability, because it causes a time delay and the assistant may block the surgeon’s view to the monitor.

Target of this thesis is to develop a technical concept that investigates the impact of wireless communication between medical devices. The concept follows a generalization strategy to be implemented on many other devices. For this purpose, only minor device specific changes should be applied. Safety criteria and the energy consumption are also taken into account. To evaluate the concept, a prototype has been implemented that meets all the requirements and that is expected to work properly. Due to the security concept, the signal range is limited to the size of the operating room. Therefore, Bluetooth Low Energy (BLE) was selected as the appropriate wireless technology, because it has a limited signal range of 10 meters. BLE also has the benefit that it consumes less energy compared to other wireless technologies like ZigBee or Wi-Fi.

The prototype was tested to determine the application throughput. The average system response latency is 217.09 ms and the average application data throughput is 18.46 Bytes/s for a data payload of 4 bytes. When using the maximum data payload of 31 bytes, the throughput can reach 143.12 Bytes/s per communication channel. In theory, the maximum application data throughput is 430 bytes/s. Compared to other wireless technologies, this connection speed seems to be exceptionally slow, but BLE was not designed for high speed data transfer. Secondly, the system response time is composed of 2 transmission time and the calculation times of the client and server. The measured latency is sufficient for a quick response to user input. Theoretically calculated the battery lasts up to 5 hours longer when using BLE instead of WLAN.

This thesis is a contribution to the current trends that will ensure a higher incidence of integrated operating rooms in the future. Based on the generalization strategy, the concept can also be implemented outside the medical domain, for instance, in the home automation field.

## February 4, 2016- 10:00am: "Entwurf und Implementierung einer temperaturabhängigen Lüftersteuerung für Arbeitsplatzrechner". Arne Salzwede.

Die am Markt etablierten Lüftersteuerungen sind meist in die Hauptplatine integriert und regeln nur auf Basis der CPU Temperatur. Im Vortrag wird ein Konzept für eine Steuerung vorgestellt, die auch Hardware-Sensoren aus GPUs nutzt und flexibel konfigurierbar ist. Das System besteht aus einem Hardware-Modul und Software-Komponenten. Es wird der aktuelle Fortschritt bei der Implementierung gezeigt und auf derzeit vorhandene Probleme eingegangen.

## January 28, 2016- 10:00am: "Analysis of computer Vision Algorithms for high performance, Real-time Embedded implementations". Kawa Haji Mahmoud.

This presentation demonstrates the latest activities towards accurate real-time computer vision based systems.
The content of this presentation will be in twofold: First, a literature survey of feature extraction and classification algorithms for face detection and recognition applications will be briefly presented. Second, the current work towards enhancing the performance of the hardware architecture for Viola & Jones algorithm will be presented. In this work, the processing of Viola & Jones algorithm is distributed over multiple FPGA boards.

## 21.01.2016- 10:00am: "Towards Efficient Real Time Face Recognition". Ahmed Elhossini.

Facial recognition is an important task in many applications such as security applications, robotics, smart mobility and visual aid devices. This presentation gives a summary about the research activities performed toward the implementation of an efficient real-time facial recognition systems. For the targeted applications, the process is usually divided into two stages:

1. face detection: locating the faces within the image, and
2. matching the found faces to existing data base of faces (recognition).

These two steps have common tasks. Both require extracting features from the image and both require a classification mechanism to find the matching object in the data base. The work done in each of these tasks will be illustrated in this presentation.

## January 14, 2016- 10:00 am: " Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency". Jan Lucas.

Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction, but a combination of scalarization and STSIMT yields a mean performance enhancement of 19.6% and improves the energy-delay product by 26.2% compared to SIMT.

## January 7, 2016- 10:00 am: "Konzeption und prototypische Realisierung einer seriellen Ansteuerung von E-Paper-Displays in einem Display-Grid". Igor Linkov.

Die Motivation der Entwicklung war die die Überlegung, dass es in vielen Bereichen der Handel und Industrie ein Großbildschirm gebraucht werden kann, der leicht, hochauflösend und sehr Energiesparsam ist. Als Basis für diese Entwicklung wurden die EP-Bildschirme gewählt. Da diese Bildschirme zurzeit max. 13" in Diagonale (Sony) sind, wäre dafür eine Lösung nötig um die Bildschirme in ein Grid zusammenfassen zu können. Die Bildschirme im Grid sollten einzeln ansprechbar sein und nur mit einem Standardkontroller gesteuert - also eine serielle Ansteuerung (Teil der Aufgabe).
Technisch gesehen sollten die Daten- und Synchronisationssignale aus einem Controller an jedes gewählte Bildschirm umleitbar sein. Herausforderung dabei war, dass die parallel übertragenden Signale verschiedene Frequenzen und keinen gemeinsamen Clock hatten. Entsprechend dem Konzept sollte die Adressierung durchdacht werden. Außerdem sollte die Aufgabe der Stromversorgung der Bildschirme gelöst werden. Mehrere Spannungen mit unterschiedlichen Vorzeichen, die am Bildschirm gebraucht werden sollten bei Bedarf am gewählten Bildschirm zugeschaltet und nach dem Update-Vorgang abgeschaltet werden.

Es wurden während der Arbeit mehrere Lösungskonzepte analysiert. Die Wahlkriterien waren Einfachheit und Robustheit. Gewählt wurde die Line-Konfiguration. Von diesem Konzept wurden einige Lösungsansätze erarbeitet, von den 2 genauer (analytisch und experimentell) untersucht wurden. Der Ansatz mit dem asynchronen Abtasten hat sich als falscher Weg gezeigt, der Ansatz mit einem Demux auf Basis eines FPGA - als funktionsfähig. Während der Experimente wurde die sichere Schaltung erreicht, die Latenzen wurden gemessen. Dabei wurden bei einige Begrenzungen festgestellt, die die Skalierbarkeit dieser Lösung verhindern.

Die Aufgabe der Adressierung wurde mittels eines AND-Bausteins und Mikroschalter gelöst. Für die Energieversorgung wurde eine Umschaltlösung mit Hi- bzw. Low-Side Switches entwickelt, die von der Adressierungsschaltung aktiviert werden konnte. Die Schaltung wurde experimentell geprüft, die Funktionalität wurde bestätigt.

## Dezember 17, 2015- 10:00am: " Bachelor/Master Thesis Grading Schemes". Angela Pohl.

With our current grading scheme for student theses, only the final written document is taken into account. This is not ideal, as most of a student's work is neglected and publishing of results can be difficult for external theses. As a consequence, we worked out a new grading scheme over the summer. It considers implementation work and presentations (midterm and final) besides the written thesis, and puts a different emphasis on each factor for bachelor and master level work. I will present the new proposal and would like to have a discussion with the group to agree on a new grading scheme.

## Dezember 10, 2015- 10:30am: "Parallel HEVC Decoder on CPU+GPU Heterogeneous Platforms". Diego Felix de Souza.

In the last few years, a highly optimized CPU-based SIMD HEVC decoder has been developed by the TU Berlin group. On the other hand, by exploiting other research directions, the INESC-ID group has already parallelized most of the HEVC decoder modules to be executed on state-of-the-art GPU devices. In accordance, the primary goal of this collaboration is to devise a joint research initiative aiming to the fulfillment  of a high performance CPU+GPU HEVC decoder, where the CPU will be responsible for the most irregular and sequential modules, i.e., entropy decoder, and the remaining modules will be performed in parallel by the GPU device.

## Dezember 10, 2015- 10:00am: "Design and Implementation of a Real-Time Tracking System for Free-Space Optical Communication". Konstantinos Vasiliou.

Free-space Optical Communication (FSO) is an optical communication technology that uses light propagating in free space to wirelessly transmit data. This technology offers the bandwidth of fiber optic communication with reduced cost and high flexibility. One of the biggest problems of FSO communication is the turbulence-induced disturbances that change the angle of arrival of the beam to the receiving terminal. Disturbances as such as beam wander and scintillation make it difficult to couple the receiving beam with an optical fiber that has a radius of 10 micrometers over the distance of several km.

For this reason adaptive optics are used in each terminal. This technique involves an analogue detector (QD) and a fast steering mirror. The detector reads the direction of the incoming beam and the controller corrects the angle of arrival of the beam by steering the fast steering mirror towards the correct direction. The purpose of the tracking system is to maintain the incoming laser beam in the center of the detector and subsequently in the center of the single-mode fiber (SMF).

The purpose of this thesis is to implement the tracking algorithms on a National Instruments cRIO-9063 controller, utilizing a dual-core ARM Cortex-A9 processor running at 667 MHz and an Atrix-7 FPGA. The controller must perform repeated measurements and corrections at a rate of 10 KHz. For the evaluation of the system, the stability, bandwidth of the tracking system and the tracking error suppression are considered. The performance of the controller will be tested in laboratory conditions and over a link of 1 km.

## Dezember 3, 2015- 10:00am: "Novel Indexing Structures for Storage Class Memory". Johan Lasperas.

Storage Class Memory (SCM) combines the performances of DRAM with the non-volatility of disks, allowing it to replace both of them in the memory hierarchy to build a single level system. This will have a significant impact on main-memory databases, as data structures will not need to be backed up to disk, but can be directly persistent in memory. But to keep data consistent in such a system is not trivial, as there is little or no control on how the data is written from the volatile CPU caches and memory controller to non-volatile memory.

This master thesis is focused on persistent main-memory B-Trees, as B-Trees are a widely used component of main memory databases. We designed a persistent and concurrent main-memory B-Tree able to recover at any time from failures into a consistent state without loss of data, while exhibiting good performances despite the slower memory accesses on current Storage Class Memory technologies, thanks to its hybrid SCM-DRAM design.

## November 26, 2015- 10:00am: "Optimizing HEVC CABAC Decoding with a Context Model Cache and Application-specific Prefetching". Philipp Habermann.

Context-based Adaptive Binary Arithmetic Coding is the entropy coding module in the most recent JCT-VC video coding standard HEVC/H.265. As in the predecessor H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Beside other optimizations, the replacement of the context model memory by a smaller cache has been proposed, resulting in an improved clock frequency. However, the effect of potential cache misses has not been properly evaluated. Our work fills this gap and performs an extensive evaluation of different cache configurations. Furthermore, it is demonstrated that application-specific context model prefetching can effectively reduce the miss rate and make it negligible. Best overall performance results were achieved with caches of two and four lines, where each cache line consists of four context models. Four cache lines allow a speed-up of 10% to 12% for all video configurations while two cache lines improve the throughput by 9% to 15% for high bitrate videos and by 1% to 4% for low bitrate videos.

## November 19, 2015- 10:30am: "Automatic Tuning of Stencil Computations with Structural Ordinal Regression Learning". Biagio Cosenza.

Stencil computation is an algorithmic pattern commonly found in many applications such as simulations, partial differential equations, Jacobi kernels, image processing, cellular automata and others.
Stencil codes are typically optimized with loop tiling, loop unrolling, as well as vectorization and multi-threading but, because of the complexity of the optimization space, automating tuning is usually required to select the best performing optimization parameters.

State-of-the-art stencil autotuner are based on search and iterative-compilation, involving very long compilation time.
We propose a novel approach based on structural machine learning, which moves the code-dependent optimization process from compile time to a code-independent, per-target processioning step.

Our approach formulates the problem in terms of Ordinal Regression Support Vector Machines by exploiting ordinal relations within runs in the same training pattern; this allows to build an accurate model that selects a good configuration with no additional compilation overhead.

We tested our model on a DSL stencil optimizer based on the Patus stencil compiler and compared it with the state-of-the-art on iterative search. Preliminary results on stencil benchmarks show that the best configuration evaluated by the our model is as fast as ~100 iterations of search heuristics.

## November 19, 2015- 10:00am: "Energy Consumption of GPU Execution Units". Jan Lucas.

Gate level power models provide accurate predictions of power consumption but use very detailed and thus slow simulations. Architectural power simulators such as GPUSimPow or GPUWattch try to model the power consumption using  abstract per-component models. Instead of using accurate input vectors for the individual components, activity factors only count the number of operations performed. Using measurements on actual GPUs we show that this is not good enough and large changes in power consumption can occur depending on input data. We develop a more accurate model for the power consumption of GPU execution model. This model can be employed to enhance the accuracy of GPU power simulators.

## November 12, 2015- 10:00am: "Evaluation of performance of eGPU using stereo vision algorithm". Shriraam Mohan.

Embedded GPUs are more and more commonly used for compute applications in embedded systems.  A new version of Nema GPU from Think Silicon, implemented on Xilinx ZC706 evaluation board, is evaluated for performance using a stereo vision algorithm.  The main function of the algorithm is to generate a 3D map from two images taken from two different angles. The individual images are preprocessed and stereo matched by calculating shortest hamming distance. The performance of the eGPU implementation of the algorithm is compared with a base line version and other implementations and any bottleneck identified because of the eGPU architecture is reported.

## November 5, 2015- 10:00am: "Profile Based Huffman Encoding with Multiple Trees for GPUs". Sohan Lal.

Modern Graphics Processing Units (GPUs) continue to scale the number of cores in accordance with Moore's law. Unfortunately, off-chip memory bandwidth is growing slowly compared to the desired growth in the number of cores. In addition to this, GPUs demand high bandwidth as GPUs architecture is oriented towards throughput. We are in a situation where off-chip bandwidth has become a performance and throughput bottleneck. Thus, approaches to optimize bandwidth will play a significant role for scaling the performance of GPUs.

Data compression is a promising approach for increasing the capacity and improving bandwidth which can translate into high performance and energy efficiency. In this talk, we present a profile based Huffman compression technique for GPUs which uses multiple trees for compression. Profiling is used to estimate the probability of input symbols. In the preliminary results, we show that the geometric mean of the compression ratio of profile based Huffman compression is about 40% more than the state of the compression technique. Furthermore, we show that the compression ratio achieved by our proposed compression technique is close to the optimal compression ratio given by information theory.

## October 29, 2015- 10:00am: "Preliminary investigation of wireless communication between medical devices to increase the device usability within minimally invasive procedures". Schürer, Martin.

Minimally invasive surgery (MIS), also known as keyhole surgery, has changed the typical way of performing operations. Influencing the techniques used in many

specialties of surgical medicine, minimally invasive procedures have become an alternative way compared to conventional ones.

A laparoscopy for example requires the use of an insufflation device that inflates the abdominal cavity with the help of carbon dioxide. The insufflator is located on a tower together with other devices that are required for MIS procedures, such as camera, lighting, pumps and monitors. The tower stands outside the sterile environment. To change insufflation settings, a common approach is that the surgeon commands a non-sterile surgery assistant to apply changes. This command chain causes a time delay which is unsuitable in hectic situations. It also happens that the assistant blocks the surgeon's view to the monitor.

A technical concept is being developed to investigate the impact of wireless  communication on medical devices within MIS. In the near future, this thesis can be a help in the establishment of integrated operating rooms, or at least a relief for the assistant.

This thesis main objective is to develop a prototype that is able to control an  insufflation device. As an optional target, the prototype should support the  simultaneous control of multiple medical devices. Therefore, the insufflator's human interface module (HIM) will be extended by a Bluetooth Low Energy (BLE) module and an insufflation GATT service implemented. Based on the classical client-server model, the insufflator represents the server and a tablet device the client. Following the device pairing, the client runs an application that communicates with the server's GATT services and thus enables a wireless communication between tablet and insufflator.

## October 22, 2015- 10:00am: "Web-Based HEVC/H.265 Decoding". Thomas Siedel.

HEVC provides better compression efficiency than H.264 but is still not used very often, especially not in the web. The main problem is the missing integration of HEVC decoders in browsers and common plugins such as Adobe Flash. While custom plugins can provide the required functionality they have disadvantages and are hard to distribute. This thesis describes a new approach where the decoding is done entirely with JavaScript directly in the browser. For this purpose a modified version of the TUB AES HEVC decoder written in C++ is ported to JavaScript with the help of the Emscripten compiler. The performance of the resulting JavaScript program is evaluated and compared to the original HEVC decoder. Performance optimizations using an experimental SIMD library for JavaScript (SIMD.js) are also implemented and evaluated.

## October 15, 2015- 10:00am: "Reduction of Radar Processing Latency". Tamilselvan Shanmugam .

Technology transforms everything simpler, faster and smarter. In an attempt to change the Radar processor, an analysis is carried out based on a mock-up of Radar processing algorithm. Well known ARM platform is chosen, keeping power consumption and portability in mind. Several ARM Cortex A9 processors are involved to compete with the capabilities of existing super-fast Radar processor. The analysis investigates various means of scheduling Radar application to the ARM processors. Being a safety critical software, worst case execution time is a crucial factor. However, the analysis concluded that the worst case execution time in ARM processors are not in the acceptable range.

This thesis studies the bottlenecks imposed by the application, data dependencies between the processing chains and possible improvements. Implementation focuses on the parallelism in Radar application and schedules them in optimal way to achieve best results. As a part of the investigation, peak processor utilization, peak memory utilization, peak bandwidth utilization, worst case execution
time, bottlenecks and future scope are discussed in detail. An analysis is presented based on the new implementation schemes.

## October 8, 2015- 10:00am: "Open publications and research data: improving the scientific method in computer science, and making funding agencies happy". Mauricio Álvarez Mesa.

It is becoming common that the results of research projects funded by public institutions are required to be published with an open access (OA) model. OA means that papers, and also the corresponding research data, should be provided to the public with free access. Open access can benefit the recipients of research information (students, researchers, general public), and it also can contribute to improve the reproducibility of scientific results. In this talk, we will review the basic principles behind open access, and also analyse how to implement a practical plan for open access to publications and research data in the field of computing systems.

## October 1st, 2015- 10:00am: "Overview of the GPU-Assisted HEVC Decoder". Diego Felix de Souza .

The added encoding efficiency and visual quality offered by the High Efficiency Video Coding (HEVC) standard is attained at the cost of a significant computational complexity of both the encoder and the decoder. In particular, the considerable amount of prediction modes and partitions that are now considered by this standard, together with the increased complexity of the adopted block coding tree structures imposes demanding computational efforts that can hardly be satisfied by current general purpose processors to attain hard real-time requirements. Furthermore, the strict data dependencies that are imposed makes parallelization a difficult and hardly efficient option with conventional approaches. To circumvent this adversity, this work exploits Graphics Processing Units (GPUs) to accelerate the decompressing procedure in HEVC, encompassing the most demanding modules of the decoder (i.e., de-quantization, inverse transform, motion compensation, intra prediction, de-blocking filter and sample adaptive offset). The presented approaches comprehensively exploit both coarse and fine-grained parallelization opportunities in an integrated perspective by re-designing the execution pattern of the involved modules, while simultaneously coping with their inherent computational complexity and strict data dependencies. As a result, the proposed parallelization, which is fully compliant with the HEVC standard, has showed to be a remarkable viable approach, being capable of satisfying hard real-time requirements by processing each Ultra HD 4K frame close to 20 ms (about 50 fps).

## September 24, 2015- 10:00am: "Data Transformation Trajectories in Embedded Systems". Nawabul Haque.

As a part of the thesis, the task is to investigate the feasibility of trajectory formation for UEs in LTE network from network equipment side(Radio Base Station - RBS equipment). The task includes identifying the necessary data and methods required for UE position estimation and building a prototype which continuously runs on one of the network equipment installed in RBS and calculates the UE positions for all the UEs present in that RBS in real time. It also involves implementation of filtering algorithms(Kalman filter, Particle filter) to filter the UE position data in order to form a real time trajectory for the UEs and finally posting this trajectory to a central server at regular intervals. This trajectory data can then be fetched from the server and projected on a map or trajectory projection system.

## July 30, 2015- 10:00am: "Transmitter Algorithm on Intel Architecture". Mostafa Shahin.

As the expiration date of Moore’s law is approached because of the power wall limit, the research and development go towards alternative solutions to fulfill the need of better performance. In this thesis, the objective is to reach real time capable functions of LTE algorithms on Intel Architecture. The presented solution is exploiting SIMD units of the General purpose Intel processor to accelerate the highly data parallel algorithms.

The modulation and demodulation functions of the LTE algorithms have been accelerated by the vectorization of the legacy code to vector width of 128 bits and 256 bits on Sandy Bridge and Haswell microarchitecture respectively. The acceleration achieved through vectorization is over 10x, which fulfills the real-time requirements of the algorithm. In order to achieve even higher accelerations, the algorithm was multi-threaded to benefit from the multi-core processor Architecture.

## July 23, 2015- 10:00am: "High performance CCSDS image data compression using GPGPUs for space applications". Sunil Chokkanathapuram Ramanarayanan.

The usage of Graphics Processing Units (GPUs) as computing architectures for inherently data parallel signal processing applications in this computing era is very popular. In principle, GPUs in comparison with Central Processing Units (CPUs) could achieve significant speed-up over the latter, especially considering the data parallel applications which expect high throughput. The thesis investigates the usage of GPUs for running space borne image data compression algorithms, in particular the CCSDS 122.0-B-1 standard as a case study. The thesis proposes the design to parallelize the Bit-Plane Encoder (BPE) stage of the CCSDS 122.0-B-1 in lossless mode using GPUs to achieve high throughput performance to facilitate real-time compression of satellite image data streams.

Experimental results are furnished by comparing the performance in terms of compression time of the GPU implementation versus a state of the art single threaded CPU and an Field-Programmable Gate Array (FPGA) implementation. The GPU implementation on a NVIDIA R GeForce R GTX 670 achieves a peak throughput performance of 162.382 Mbyte / s (932.288 Mbit / s ) and an average speed-up of at least 15 times the software implementation running on a 3.47 GHz single core Intel R Xeon TM processor. The high throughput CUDA implementation using GPUs could potentially be suitable for air borne and space borne applications in the future, if the GPU technology evolves to become radiation-tolerant and space-qualified.

## 16.07.2015- 11:00 Uhr: "IDCT Hardwareimplementierung für einen H.265/HEVC Dekoder". Boris Graf.

Im Fachgebiet Architektur Eingebetteter Systeme (AES) der TU Berlin wird eine effektive Hardwareimplementierung eines HEVC/H.265 Dekoder angestrebt. Die Zielplattform ist ein Xilinx Zynq-7045 SoC. Im Rahmen der Erforschung wurde in dieser Arbeit eine Implementierung der inversen Transformation ausgearbeitet. Ziel war es einen akzeptablen Durchsatz bei einem möglichst geringem Ressourcenverbrauch zu erzielen. Basierend auf dem Partial Butterfly Algorithmus konnte die Fläche durch Wiederverwenden der Hardwarekomponenten reduziert werden. Mit der Implementierung einer Pipelinestufe wurde eine höhere Taktfrequenz erreicht. Außerdem konnte mit dem Wissen über die Nullkoeffizienten die Rechenzeit reduziert werden. Die Implementation der inversen Transformation hat einen Verbrauch von 6% der Ressourcen der Zielplattform. Mit der maximalen Taktfrequenz von 155 MHz ist es möglich 4K (3840x2160)@50 fps Videomaterial verarbeiten zu können.

## 16.07.2015- 10:30 Uhr: "SAO und Deblocking Filter Hardwareimplementierung für einen H.265/HEVC-Dekoder". Sven Erik Jeroschewski.

Der im Jahr 2013 veröffentlichte High Efficiency Video Coding-Standard (HEVC-Standard) soll die Effizienz der Kompression von Videosequenzen erhöhen und definiert dazu u.a. zwei In-Loop-Filter, die im Dekoder bei auf alle rekonstruierten Bilder eines Videos angewendet werden. Dabei handelt es sich einerseits um einen Deblocking-Filter, der unerwünschte Kanten filtern soll, die durch die im HEVC-Standard vorgegebene blockweise Kodierung der Videosequenzen entstehen können. Andererseits beinhalten die In-Loop-Filter einen Sample Adaptive Offset-Filter (SAO-Filter), der Ringing- und Verschleierungsartefakte entfernen soll. In der vorliegenden Arbeit wird vorgestellt, wie sich die beiden Filter effizient gemeinsam in einer Hardwarekomponente implementieren lassen. Dabei wird ausgenutzt, dass sich innerhalb eines Bildes Bereiche finden lassen, die fast unabhängig voneinander gefiltert werden können. Die vorgestellte Lösung wurde mit der Hardwarebeschreibungssprache VHDL implementiert. Bei der Synthese für einen Zynq 7045 FPGA von der Firma Xilinx zeigte sich, dass die implementierte Komponente einen geringen Ressourcenbedarf aufweist und für die Filterung von Videosequenzen mit 60 Bildern pro Sekunde und einer Auflösung von 3840x2160 Pixeln in Echtzeit angewendet werden kann.

## 16.07.2015- 10:00 Uhr: "Hardwareimplementierung der Intra-Prediction für einen HEVC-Dekoder". Jonas Eike Tröger.

Im Rahmen meiner Bachelorarbeit habe ich eine Komponente für die Durchführung der Intra-Prediction nach dem HEVC-Standard in der Hardwarebeschreibungssprache VHDL implementiert. Die Implementierung ist für den Einsatz in einem Hardwaredekoder bestimmt, welcher in einem FPGA realisiert werden soll. Die implementiert Komponente kann die Intra-Prediction für Videomaterial mit der Größe 1920x1080 und einer Bildwiederholfrequenz von 60 fps bei der ausschließlichen Verwendung von Intra-Prediction durchführen. Für die Implementierung wurden die zu implementierenden Algorithmen analysiert und an mehreren Stellen optimiert um eine effektive Hardwareimplementierung zu ermöglichen. Desweiteren wurden unterschiedliche Faktoren identifiziert, welche die Performance der Implementierung und die mögliche Performance anderer Implementierungen beeinflussen. Es wurden sowohl Faktoren, welche Teil des Algorithmus sind, als auch Faktoren, welche Teil des allgemeinen Dekodierablaufs sind analysiert. Abschließend wurden Simulationen der Komponente mit unterschiedlichem Videomaterial durchgeführt und die Ergebnisse mit Statistiken über das Material in Verbindung korreliert.

## July 9h, 2015- 10:00am: "Design and Implementation of an FPGA-based Optical QAM Receiver for Real-Time Applications". Björn Böttcher.

Digital signal processing (DSP) has recently gained high importance in modern optical communication systems. DSP development is therefore a key for the design and development of optical transmission systems. To test new algorithms in real-time in laboratory experiments, a flexible infrastructure based on FPGAs is highly desirable. This presentation gives an overview about the basics of a digital coherent optical transmission system and discusses options for the realization of the FPGA platform. Further it depicts the current state of the implementation phase of this master thesis and in the end it gives an outlook for the finalization.

## July 2th, 2015- 10:00am: "Exploration of Interoperability between OpenCL and OpenGL". Michael Kuerbis.

Some applications, such as video playback engines, have a chance to utilize both the parallel computation API OpenCL and the graphics API OpenGL. In such a case, the interoperability between OpenCL and OpenGL is useful as it enables OpenGL to render a frame decoded with OpenCL on the GPU directly. In this study, the interoperability between OpenCL and OpenGL is explored across different platforms, with a focus on the exchange of data buffers between the APIs. A library to cover this interoperability is presented, together with a first glance at the research results.

## June 25th, 2015- 10:00am: "Evaluating Nexus++ as a Tightly Coupled Coprocessor for ARM Multicore Processors". Haopeng Han.

Task-based execution model OmpSs is a promising solution for multicore programmability.
According to previous research done at AES@TU-Berlin, the runtime system is a bottleneck that limits the performance of this programming model. A hardware accelerator, Nexus++, was therefore developed to boost the performance. The aims of this Master thesis work are to prototype Nexus++ on a state-of-the-art FPGA SoC -- Xilinx Zynq, and to evaluate it with a set of benchmarks.

## June 4th, 2015- 10:00am: "A scheme to support two dimensional memory access in image and video processing applications". Tareq Alawneh.

The processor speed has increased much faster than the memory speed. Therefore, the expected performance improvements from increasing the processor speed are limited by the memory latency. Most of the algorithms, which work on multimedia data such as images and videos, exhibit a distinguished data access pattern. This data access pattern is called the two dimensional access. If this peculiar data access is exploited properly, it could yield significant application performance improvement.

Unfortunately, the standard cache and memory organizations do not mirror the way that data are accessed by the image and video image processing applications. As a result, they achieve poor performance when used for multimedia. In this work, we propose a memory organization to exploit the 2D data access pattern.

## May 21th, 2015-10:30am: "A hybrid HEVC CPU-GPU Decoder". Biao Wang.

In this talk, a design of HEVC decoder with GPU acceleration using OpenCL will be presented.

HEVC is the latest video codec with a promise of 2x compression efficiency compared to preceding H.264/AVC video codec. However, the cost for higher compression efficiency is the increased computation complexity.
With the help of parallelism techniques such as multi-threading and SIMD extensions instructions (Single Instruction and Multiple Data), general purpose processors might be capable to achieve real-time decoding for 1080p (1920x1080) and 4K (3840x2160) videos. However, it is difficult for them to deal with the next generation 8K (7680x4320) videos because of its increased input data set.

With the emergence of GPGPU (General Purpose GPU) computation, the above problem inspires us to employ GPU to accelerate the decoding process. In this talk, the parallelism of different kernels within HEVC decoding are analyzed and a kernel partition scheme between CPU and GPU is proposed based on this analysis. In addition, a performance model is proposed based on memory bandwidth to predict the performance of this design.

## May 21th, 2015-10:00am: "Research Overview & Proposal". Biagio Cosenza.

In this talk, I will give a quick overview of my research work.

I will start by briefly showing my interdisciplinary work, mostly produced within the DK-Plus multidisciplinary research program at the University of Innsbruck and including collaborations with astrophysics, chemists, engineers and computer graphics experts.

The second contribution is the Insieme Compiler framework, in particular the OpenCL infrastructure. I will illustrate how we solve complex problems such as automatic partitioning for heterogeneous architecture, data layout optimizations, and automatic parallelization from sequential codes.

Thus, I will introduce our previous work on programming models and abstraction for HPC; our contribution is libWater, a runtime system that extends OpenCL and supports heterogeneous distributed cluster architectures. It integrates with the Insieme Compiler analysis and code generation and has been tested on compute clusters available at BSC (Spain), VSC (Vienna) and CEA (France).

I will conclude my presentation with a plan illustrating ongoing collaborations and research topics.

## April 30th, 2015-10:00am: "A Proximity scheme for instruction caches in Tiled CMP Architectures". Tareq Alawneh.

Recent research results show that there is a high degree of code sharing between cores in CMP architectures. In this study we propose a proximity scheme for the instruction caches, a scheme in which the shared code blocks among the neighboring L2 caches in tiled CMP architectures are exploited to reduce the average cache miss penalty and the on-chip network traffic.

## April 23th, 2015-10:00am: "Effcient Real-Time FPGA Implementation of Kaze Features". Lester Kalms.

KAZE describes 2D features in a non-linear scale space by means of non-linear diffusion filtering. Diffusion is the physical process of balancing concentration differences, without creating or destroying masses. Non-linear diffusion makes this diffusion process adaptive to the local image structure. This reduces noise but keeps image boundaries in higher scale levels, unlike in the Gaussian scale space (linear diffusion), where all structures in an image are smoothen evenly. This work rewrites the software implementation of KAZE features for an ARM processor and optimizes it for speed, memory usage and portability. Furthermore, the algorithm has been accelerated, by outsourcing a big part of the software into a hardware design, which is realized on a FPGA. Besides the good speed up, there are several further advantages of this design, like scalability and predictability.

## April 9th, 2015-10:00am: "Performance Evaluation of GPU Video Rendering". Haifeng Gao.

The recently introduced HEVC/H.265 codec is projected to allow high quality high resolution videos to be compressed in reasonable bitrates. While the complexity of HEVC decoding has been solved, the video playback for high performance UHD videos rendering is still missing.

In this thesis, a solution of rendering videos on the GPU by using OpenGL is developed and the performance for each of rendering stages is evaluated. It allows for investigating and detecting performance bottlenecks on the hardware platforms and OpenGL software stack.

According to the performance evaluation, we found that texture uploading is the performance critical part of video rendering, even though it has been highly optimized. On discrete GPUs, the performance of video rendering is mostly limited by the data transfer bandwidth of PCI express, while on integrated GPUs, the memory bandwidth is the limitation.

## March 19th, 2015-10:00am: "Effcient Real-Time FPGA Implementation of Kaze Features". Lester Kalms.

KAZE features is a multi-scale 2D feature detection and description algorithm. It describes 2D features in a non-linear scale space by means of non-linear diffusion filtering. By making the diffusion process adaptive to the local image structure, it reduces noise but keeps image boundaries in higher scale levels. This work rewrites the software implementation of KAZE for an ARM processor and optimizes it for speed and memory usage. A hardware accelerator hast been implemented on a FPGA, to speed up the most time consuming block in the algorithm.

## February 26th, 2015-10:00am: "Using OpenCL for FPGAs: Is RTL going to die?". Matthias Göbel.

FPGA design are traditionally written in hardware description languages like VHDL or Verilog. However, with the increasing complexity of FPGA projects, the disadvantages of this approach, like poor maintainability and and long time-to-market, are increasingly evident.

To tackle this problem, leading FPGA manufacturers like Xilinx and Altera propagate OpenCL as a framework for FPGA designs. In this talk, the idea of using OpenCL for FPGA designs will be discussed and the according products of both companies will be presented in order to see whether OpenCL is the way to go in FPGA designs.

## February 12th, 2015-10:00am: "Hardware Accelerated Framework for Computer Vision". Kawa Haji Mahmoud .

Computer vision systems are becoming an inevitable requirement in a wide range of applications. Embedded implementation of computer vision systems on embedded platforms is gaining interest due to the widespread of embedded devices such as smartphones and tablets. Real-time performance as well as power consumption are crucial key elements for designing various embedded systems. The complexity of computer vision algorithms has been the bottleneck in order to meet the aforementioned design elements. Several attempts have been introduced for the acceleration of computer vision algorithms. For instance, custom vision algorithm design on FPGA's, optimised GPU accelerator are among many other attempts.

In this presentation, we talk about heterogeneous hardware accelerated platforms that can leverage the advantages of different hardware platforms, and resulting in an embedded vision framework to meet the different application demands. As an example, we present a study about the numerous attempts to accelerate Harris Corner Detection algorithm as well as our attempt towards high-performance and low cost implementation of this algorithm.

## January 29th, 2015-10:30am: "FPGA Implementation of License Plate Detection and Recognition". Farid Rosli.

Rapid development of the Field Programmable Gate Array (FPGA) offers an alternative way to provide acceleration for computationally intensive tasks such as digital signal and image processing. Its ability to perform parallel processing shows the potential in implementing a high speed vision system. Out of numerous applications of computer vision, this thesis focuses on the hardware implementation of one that is commercially known as Automatic Number Plate Recognition (ANPR). It is a modern CCTV-based surveillance method that has the ability to locate and recognize vehicle registration number. Morphological operations and Optical Character Recognition (OCR) algorithms have been implemented on a Xilinx Zynq-7000 All-Programmable SoC to realize the functions of an ANPR system. Test results have shown that the designed and implemented processing pipeline that consumed 63 % of the logic resources is capable of delivering the results with relatively low error rate. Most importantly, the computation time satisfies the real-time requirement for many ANPR applications.

## January 29th, 2015-10:00am: "Memory Divergence: Data over-fetch and locality analysis". Sohan Lal.

GPUs have been very successful in accelerating applications with regular computation and access patterns.
However, they do not work well with irregular applications. One of the reasons for low performance of these applications is that they suffer from memory divergence.
An access to the memory is divergent if intra-warp memory accesses for a load or store instruction cannot be coalesced into single memory transaction.
Memory divergence is known to cause many problems, including data over-fetch which can waste scarce resources such as cache capacity and memory bandwidth.
In this work, first we quantify the amount of data over-fetch caused by memory divergence and then we show that memory divergent kernels have unexploited data locality which if exploited will reduce data over-fetch.

## January 22th, 2015-10:00 to 10:30am: "Two-Dimensional Memory implementation supporting the image processing applications". Tareq Alawneh.

Memory performance is an important factor of the overall system performance.
The memory performance is highly dependent on the manner in which the access patterns are handleable in the modern memory devices.
For the image processing applications, the manner in which the memory accesses in the conventional memory devices are resolved results in poor cache performance due to the non-sequential access patterns (e.g. two-dimensional or block access pattern) and poor temporal locality of these applications.

This work introduces a memory implementation, a technique that improves the performance of the memory and caches by detecting the two-dimensional accesses in the memory system for the image processing applications and providing the cache with the data it will be requested from the processor soon.
The scheme aims to reduce the memory latencies, increasing the throughput, reducing the power consumption, and increasing the banks utilization.

## January 8th, 2015-10:00 to 10:30am: "Scalarization and Temporal SIMT in GPUs-reducing redundant operations for better performance and higher energy efficiency". Jan Lucas.

Scalarization is a technique to remove certain redundant operations from GPU kernels. In this presentation, we will analyze why these redudant operations are common in GPU kernels and how they can be recognized by the compiler. We will then explain how Scalarization can be integrated into different GPU architectures. In this part of the presentation we will display the benefits of the combination of Scalarization with Temporal SIMT based GPU architectures. Finally we will show how Scalarization results in significant improvements in performance and energy efficiency in many common GPU applications.

## December 11th, 2014-10:30 to 11:00am- "Revisiting SIMD Vectorization ". Angela Pohl.

The vast majority of current processors are equipped with Single Instructions, Multiple Data (SIMD) instruction set extensions. They promise significant performance gain for applications where the same operation is applied to multiple independent data chunks.
The techonology was introduced to microporocessors in 1994, however, one of the major downsides has been the significant effort that is required to utilize SIMD instructions. Although solutions such as auto-vectorization have been introduced to modern compilers, real-world applications still struggle to achieve maximum speed-up without manual code optimizations.
This presentation will talk about the current inefficiencies when using SIMD, and propose a solution how to significantly decrease the effort of manual code tuning with only marginal performance impact.

## December 11th, 2014-10:00 to 10:30am: "Look-Ahead Analysis for the HEVC Video Encoder". Gabriel Cebrian Marquez.

The High Efficiency Video Coding (HEVC) standard achieves a reduction of up to 50% bit rate with the same subjective quality compared to its predecessor H.264/AVC. However, this increase in coding efficiency has a deep impact in the computational complexity of the encoder. In order to overcome this complexity, a huge effort is being carried out by the scientific community to develop algorithms and techniques that reduces it.

In this scenario, a look-ahead algorithm is a well-known pre-analysis step performed on the encoder to extract some preliminary information. This information can be used to skip some operations in the encoding process and, hence, to obtain substantial speed-ups at the cost of negligible coding efficiency penalties. Nevertheless, there is almost no research available in the state of art regarding this pre-analysis algorithm. This presentation aims to show the main steps carried out towards the development of an efficient look-ahead algorithm, as well as the conclusions that have been drawn from them. Some initial results will also be presented.

## November 27th, 2014-10:30 to 11:00am-Ultra-Low Power Parallel Architectures- Daniele Bortolotti.

Since I recently joined the AES research group, in this talk I will present an overview of the research activity carried out in the EEES Lab at the University of Bologna. The talk is divided into three parts. First, I will introduce the reference multi-core architecture, a scalable template suitable for the many-core era, then I will introduce the Virtual Platform that I jointly developed and used as a research tool. Finally, I will present a summary of the most relevant research aspects when the architecture is operated in near-threshold regime, such as HW/SW mechanisms to tackle variability, hybrid memory to guarantee error-free operation and ULP biomedical processing.

## November 27th, 2014-10:00 to 10:30am-Rate Control Optimization in the AES HEVC Encoder - Sergio Sanz-Rodriguez.

The quality of service of a video coding application is mainly characterized by the target bit rate, so the output bit rate of a video encoder is the key parameter to be controlled in order to meet the target bit rate. For this purpose, a rate control (RC) is embedded in the video encoder. This presentation is organized in four parts: In the first part the AES HEVC encoder is briefly introduced; in the second part the RC problem in video compression is given; in the third part the basic ideas of the first RC version is given making special emphasis on its weaknesses; finally, the four part introduces some ideas to improve the performance of the algorithm. This optimizations will be implemented in a second RC version.

## November 13th, 2014-10:30 to 11:00am-Parallelization of HEVC Deblocking Filter on GPUs- Biao Wang.

High Efficient Video Coding(HEVC) standard is the state of the art on video coding technology. Compared to its predecessor H.264/AVC, it achieves equivalent subjective quality with approximately 50% less bit rate. However, the price to be paid for the improved coding efficiency is the increased computation complexity of the encoder and decoder. Deblocking filter is proposed to reduce visual artifacts at block boundaries for the reconstructed picture. It takes 14% of the decoding time according to performed decoder profile experiments.

This presentation shows the strategy on parallelization of deblocking filter on Graphic Processing Units (GPUs) using OpenCL programming model. Compared to H.264/AVC, the deblocking filter expose more parallelism as each 8x8 grid of samples can be operated independently. Therefore, each thread is applied on one 8x8 grid of samples, which can be deblocked in parallel. An improved kernel for chroma component, in which the two chroma planes are interleaved, is also presented.

## November 13th, 2014-10:00 to 10:30am- Data Processing on Heterogeneous Hardware- Max Heimel.

Modern processors are no longer primarily bound by transistor density and frequency but by their power and heat budgets. This so-called “power wall” forces hardware vendors to rely to a greater extent on designing specialized devices that are optimized for certain types of computations, resulting in an increasingly heterogeneous hardware landscape. Accordingly, in order to keep up with the performance requirements of the modern information society, tomorrow's database systems will need to exploit and embrace this increased heterogeneity.
In this talk, I will cover our work in the area of database engine design for a heterogeneous hardware landscape. In particular, I will talk about Ocelot, a hardware-oblivious database engine that aims at reducing the development overhead that is required to support new hardware architectures. Ocelot utilizes hardware abstraction and adaptive learning techniques to seamlessly run the same operator code across various different devices, and to learn and identify the strengths and weaknesses of a given hardware architecture at runtime. By following this paradigm, Ocelot is able to efficiently utilize the compute power of multi-core CPUs, graphics cards, and accelerator cards like the Xeon Phi, without having been manually programmed or tuned for any of these architectures.

## October 16th, 2014-10:30 to 11:00am- HEVC CABAC Hardware Decoding- Phillipp Habermann.

Context-based adaptive binary arithmetic coding has been a throughput bottleneck in previous video coding standards, and it still is in HEVC. Due to it's sequential nature, caused by strong bin-to-bin dependencies, no data-level parallelism can be exploited. Thread-level parallelism is an option when high-level parallelization tools, like tiles and wavefront parallel processing, are used. Unfortunately, they are not mandatory in HEVC. Previous work has proved that a hardware accelerator is an appropriate optimization technique for this crucial part of a video decoder.

The presentation shows the current status of the HEVC hardware decoding project. A full CABAC hardware decoder has been implemented that provides a significant speed-up compared to software solutions. Current performance results are shown, as well as the next development steps of the project.

## October 16th, 2014-10:00 to 10:30am- An evaluation of memory interconnect networks in All-programmable SoCs- Matthias Göbel.

The last years have seen the introduction of so called „All-programmable SoCs“ by various vendors. These devices combine powerful multi-core processors with programmable logic like that found in FPGAs. They offer a great platform for HW/SW-codesign and allow higher performance than traditional methods.

Nonetheless, HW/SW-codesign approaches still suffer from the communication bottleneck between processors and HW solutions. In most cases the communication is performed by using shared memory. Therefore the memory access by both the processor and the hardware parts of such a codesign is a limiting and therefore crucial aspect of any HW/SW-codesign platform.

This presentation deals with an analysis of the various memory and communication interconnects found in an actual device, namely the Zynq-7000 by Xilinx. Issues like different access pattern, cache coherency and fullduplex communication are covered, mainly with a focus on applications from the field of video coding. Furthermore, the presentation shows that by carefully choosing the memory interconnect networks as well as the software interface a tremendous speedup can be achieved that is only limited by the utilized memory chips.

## October 2th, 2014-10:30 to 11:00am- Performance Evaluation of GPU Video Rendering- Haifeng Gao.

In general, a video player processes audio and video streams from a media file. Video processing is a very popular and most important part for a video player. The HEVC group focuses on the video processing and they have developed a optimized video decoder. In order to display the decoded video stream on the screen, a GPU video renderer for raw video is still needed. The aim of this work is to implement a simple but optimized video renderer based on OpenGL and finally integrate it with HEVC-codec. For this purpose we do a performance evaluation for OpenGL rendering on different platforms in order to find out the bottleneck in OpenGL rendering pipeline .This presentation will introduce you the general idea of video rendering and present the research result. We find that the texture transfer between CPU memory to VRAM is the biggest bottleneck in rendering pipeline and we also find that for Intel integrated GPU, the performance of OpenGL driver on Windows is much better than that on Linux.

## October 2th, 2014-10:00 to 10:30am- Hardware Implementation of a Face Detection System on FPGA- Hichem Ben Fekih.

Robust and rapid face detection systems are constantly gaining more interest, since they represent the first stone for many challenging tasks in the field of computer vision. This thesis proposes a hardware design based on the object detection system of Viola and Jones using Haar-like features. The proposed design is able to discover faces in real-time with high accuracy. Speedup is achieved by exploiting the parallelism in the design, where multiple classifier cores can be added to the system. To maintain a flexible design, classifier cores can be assigned to different images. Moreover using different training data, every core is able to detect a different object. As development platform the Zynq-7000 All Programmable SoC by Xilinx is used, which features an ARM Cortex-A9 dual-core CPU and a programmable logic similar to the FPGA. The current implementation focuses on the face detection and achieves a real-time detection at 16.53 FPS on image resolution of 640x480 pixels, which represents a speedup of 6.46 times compared to the equivalent OpenCV software solution.

## September 18th, 2014-10:30 to 11:00am- Horizon 2020 Proposals: ACRO and VisionNet- Ahmed Elhossini.

Abstract of ARCO: Autonomous, Cooperative Robots for Off-highway applications Machines for agricultural use are characterized by control functions, and are ideal candidates for automation through robotics technologies. This will not only dramatically reduce the number of required man-hours, but can cut construction costs too with up to 40% while improving the efficiency and precision of the actual farming work. The main objective of the ACRO proposal is to achieve safe and autonomous yet cooperative operation for agricultural vehicles, through sensor fusion between vehicles via safe and secure wireless links. Many automatic driving routines are already implemented, but the overall missing link is environmental awareness. The safety of humans and animals can only be guaranteed through the fusion of data from many different sensors on trusted vehicle devices, and shared within a fleet over secure wireless links. This way, fault tolerance is guaranteed through redundant sensors. A remote operator will make the ultimate decision in case of a dangerous situation. The construction of safe-by-design intelligent vehicles, which cooperate to offer complex functions, will lead to extraordinary efficiency and precision.

Abstract of VisionNet: Cooperative Vision Networks Exploiting Adaptive Data Fusion In this project we address the challenges and requirements for smart distributed vision systems. In traditional multi camera networks the information (images, features, detection results...) flows one way and is fused in a central location at a fixed level of abstraction. Moreover the cameras in the network tend to be similar. In contrast, the project aims to develop truly cooperative vision systems, with multi-way and dynamically adapted data flows over non-traditional wireless communication channels. This reduces communication overhead and facilitates smart "in camera" video processing, resulting in a scalable system. Also, we consider heterogeneous networks composed of battery operated cameras of not only varying resolution but also varying spectral sensitivity.

## September 18th, 2014-10:00 to 10:30am- Model-Checking Memory-related properties of HW/SW Codesigns- Marcel Pockrandt.

Hardware/Software codesign gained more and more importance during the last years. It is widely used in industrial system design and especially in the field of safety critical systems. For those systems, testing is insufficient. Although there exist several approaches for the formal verification of HW/SW codesigns, the support for many important aspects is still immature.
In this presentation, an approach for the fully-automatic transformation of SystemC/TLM designs into equivalent UPPAAL timed automata is presented.  For this, we use a formal memory model, which enables the formal representation of a large variety of memory-related constructs and operations. Additionally, we fully-automatically generate verification properties for the UPPAAL model. With this, we can use the UPPAAL model checker to ensure the absence of several common memory-related errors, including null pointer accesses, array out-of-bounds accesses and double frees.

## September 11th, 2014-10:00 to 10:30am- Languages and tools for efficient parallel programming - Jeronimo Castrillon.

Embedded (heterogeneous) multi-processor systems are found today almost everywhere, from embedded high-performance computing, over smartphones to deeply embedded systems. As a consequence, we have seen a proliferation of methods, tools and algorithms to better program these parallel systems. An example is the considerable effort invested in academia to devise solutions to support traditional sequential programming by automatically extracting parallelism. Despite great contributions in this area, it is clear that parallel programming models are needed in the long run to better leverage parallel computing. Dataflow programming is a good example of a successful explicit parallel model being used by many researchers in the embedded domain. These models are a good match for multimedia and signal processing applications for multi-cores. For many-cores however, extensions to dataflow models or new paradigms are needed that allow to express parallelism implicitly. This talk will address these issues and present the vision of the newly created chair for Compiler Construction in the context of the cfaed cluster of excellence at the TU Dresden.

Biography:

Jeronimo Castrillon received the Electronics Engineering degree with honors from the Pontificia Bolivariana University in Colombia in 2004, the master degree from the ALaRI Institute in Switzerland in 2006 and the Ph.D. degree (Dr.-Ing.) on Electric Engineering and Information Technology with honors from the RWTH Aachen University in Germany in 2013. From early 2009 to April 2013 Dr. Castrillon was the chief engineer of the chair for Software for Systems on Silicon at the RWTH Aachen University, where he was enrolled as research staff since late 2006. From April 2013 to April 2014 Dr. Castrillon was senior scientific staff in the same institution. In June 2014, Dr. Castrillon joined the department of computer science of the TU Dresden as professor for compiler construction in the context of the German excellence cluster “Center for Advancing Electronics Dresden” (CfAED). His research interests lie on methodologies, languages, tools and algorithms for programming complex computing systems.

## August 6th, 2014- 10:00-10:30am: Microarchitectural Latencies in GPU Throughput Processors - Michael Andersch.

Today's graphics processing units (GPUs) are fundamentally throughput-focused devices designed to hide microarchitectural latency through heavy use of thread-level parallelism. Over the last few generations of commercial GPUs, throughput has increased substantially as a result of both architectural innovations and advancements in manufacturing technology. At this point, there are two main classes of applications that are run on GPU devices: Those limited by throughput, and those limited by latency. Many well-known contemporary general-purpose GPU applications fall into the former category, as these are applications that map well to the GPU, usually yielding large performance benefits over CPU implementations. In the latter category, however, today's application landscape is rather sparse, as latency-limited applications are usually unable to achieve good performance on GPUs as a result of high instruction latencies the GPU is unable to hide. To this day, throughput-limited applications have been the drivers behind the GPU's architectural innovations. Thus, for throughput-limited applications, the architectural trends of the last years will likely continue to provide performance increases in the future. For latency-limited applications, however, good GPU performance is becoming more and more a distant hope, as the focus of GPU evolution has always been to improve the GPU's throughput processing capabilities. Indeed, GPU design today is so throughput-focused that the actual values of key microarchitectural latencies, such as pipeline depths or even cache hit latencies, are unknown for most commercially available GPUs. With this work, the blissful ignorance of microarchitectural latency among GPU architects is recognized and, consequently, mitigated by working towards a better understanding of latencies in GPUs through a multi-step GPU microarchitecture latency analysis. In the first step, the values of key latencies in multiple modern NVIDIA GPU microarchitectures are determined and interpreted. Afterwards, a microarchitecture performance simulator is employed to investigate how latency behaves in GPUs when executing diverse real-world programs and the relationship between latency and performance is discussed. Finally, the usefulness of the previous findings is demonstrated by the proposal of four enhancements to the microarchitecture of modern GPUs that significantly improve performance for memory-sensitive, latency-limited applications. Together, these enhancements are capable of improving mean application performance by 8.5% while never degrading performance for the used set of benchmarks.

## July 17th, 2014-10:30 to 11:00am-Nexus#: Distributed Task Graph Managers for the OmpSs Programming Model- Tamer Dallou.

Task based programming models, such as OmpSs, are gaining more importance with the ever increasing number of cores on chip. By adding simple annotations to the user code, they can exploit the available parallism dynamically, regardless of how complex or irregular the dependency pattern is. This comes at the cost of runtime overhead of dynamic dependency resolution. Nexus# is a configurable hardware accelerator designed to overcome this overhead. It utilizes a distributed task graph model to track dependencies and decide the next ready task(s) to run. Using a set of synthetic benchmarks, experimental results show that Nexus# is more scalable than Nexus++; the centralised task graph manager predecessor of Nexus#.

## July 17th, 2014-10:00 to 10:30am- Integration einer LSU mit parallelem Lesezugriff in eine VLIW Architektur - Tobias Huesgen.

VLIW architetures are able to execute multiple operations simultaneously. But the parallel execution is limited by the obtainable ILP of the application and the available ressources of the architecture. In particular, there is usually only one ressource available to access memory. But often parallelization can only be used, if the corresponding demand for data can be satisfied.

This thesis presents an extension to the ρ-VEX processor that allows parallel loading of multiple words in one instruction. The basis forms an existing LSU that will be integrated using VHDL. All steps including selection of the necessary instruction set extension, integration work and evaluation of the implementation will be documented. As shown, the proposed λ-VEX-Design allows a speedup up to 45% for the chosen use cases, while only 13% more ressources on a FPGA are needed.

## July 3th, 2014- 10:30-11:00am - GPGPU Workload Characteristics and Performance Analysis - Sohan Lal.

GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and continue high performance computing growth, new architectural and application techniques are required to create power-efficient computing systems. To find such techniques, however, it is necessary to study the power consumption at a detailed level and understand the bottlenecks which cause low performance. Therefore, in this paper, we study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency. We divide the low performance kernels into low occupancy and full occupancy categories. For the low occupancy category, we study if increasing the occupancy helps in increasing performance and energy efficiency. For the full occupancy category, we investigate if these kernels are limited by memory bandwidth, coalescing efficiency, or SIMD utilization.

## July 3th, 2014-10:00-10:30am -An Integrated Hardware-Software Approach to Task Graph Management- Nina Engelhardt.

Task-based parallel programming models with explicit data dependencies, such as OmpSs, are gaining popularity, due to the ease of describing parallel algorithms with complex and irregular dependency patterns. These advantages, however, come at a steep cost of runtime overhead incurred by dynamic dependency resolution. Hardware support for task management has been proposed in previous work as a possible solution. We present VSs, a runtime library for the OmpSs programming model that integrates the Nexus++ hardware task manager, and evaluate the performance of the VSs-Nexus++ system. Experimental results show that applications with fine-grain tasks can achieve speedups of up to 3.4x, while applications optimized for current runtimes attain 1.3x. Providing support for hardware task managers in runtime libraries is therefore a viable approach to improve the performance of OmpSs applications.

## June 26th, 2014-10:00-10:30am -Single-Graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs- Dani Voitsechov.

The single-graph multiple-flows (SGMF) architecture combines coarse-grain reconfigurable computing with dynamic dataflow to deliver massive thread-level parallelism. The CUDA-compatible SGMF architecture is positioned as an energy efficient design alternative for GPGPUs.

The architecture maps a compute kernel, represented as a dataflow graph, onto a coarse-grain reconfigurable fabric composed of a grid of interconnected functional units. Each unit dynamically schedules instances of the same static instruction originating from different CUDA threads. The dynamically scheduled functional units enable streaming the data of multiple threads (or graph flows, in SGMF parlance) through the grid. The combination of statically mapped instructions and direct communication between functional units obviate the need for a full instruction pipeline and a centralized register file, whose energy overheads burden GPGPUs.

The SGMF architecture delivers performance comparable to that of contemporary GPGPUs while consuming ~50% less energy on average.

## June 5th, 2014- 10:30-11:00am - FPGA and Emulation Research at Intel Labs - Angela Pohl.

Today, big box and FPGA-based emulation systems are commonly used as validation and development vehicles at Intel. This presentation will talk about their development over the past years, highlight the unique demands of emulating product-ready processors, and present the related research conducted by the groups within Intel Labs.

## June 5th, 2014-10:00 to 10:30am- Sparkk: Quality-Scalable Approximate Storage in DRAM - Jan Lucas.

DRAM memory stores its contents in leaky cells that require periodic refresh to prevent data loss. The refresh operation does not only degrade system performance, but also consumes significant amounts of energy in mobile systems. Relaxed DRAM refresh has been proposed as one possible building block of approximate computing. Multiple authors have suggested techniques where programmers can specify which data is critical and can not tolerate any bit errors and which data can be stored approximately. However, in these approaches all bits in the approximate area are treated as equally important. We show that this produces suboptimal results and higher energy savings or better quality can be achieved, if a more fine-grained approach is used. Our proposal is able to save more refresh power and enables a more effective storage of non-critical data by utilizing a non-uniform refresh of multiple DRAM chips and a permutation of the bits to the DRAM chips. In our proposal bits of high importance are stored in a high quality storage bits and bits of low importance are stored in low quality storage bits. The proposed technique works with commodity DRAMs.

## Mai 15th, 2014-10:00 to 10:30: Proximity Coherence for Instruction Memory in Tiled CMP Architectures- Tareq Alawneh.

Many-core architectures have emerged as a result of continued growth in transistor density and the high complexity of core design. With continuing increase in the number of integrated processor cores on a single die and the size of the caches in CMP architectures, the area required by the shared-bus and the high access latency of a large cache are becoming a critical part of future CMP architectures. Therefore, the tiled CMP architectures are introduced as the best choice for the future scalable CMPs.

An important challenge for the many-core architectures is maintaining cache coherence in an efficient manner. The intention from existing cache coherence protocols is to create a single and consistent view of memory for the executed threads at any given point in time, in spite of the presence of the private caches per-core. There are mainly two dominant classes of protocols to maintain cache coherence in the parallel computing systems: snoopy-based and directory-based protocols. Directory-based protocols have been proposed as alternative solution to avoid the bandwidth overhead and reduce the power consumption of snoopy-based protocols. Unfortunately, the traditional directory protocols introduce an indirection to obtain the coherence information from the directory which adds hops into the critical path of the cache misses and thereby adds extra latency. In my presentation, I will talk about a scheme to reduce the impact of this problem in the performance.

## April 3th, 2014- 10:30am: Spin Digital, Mauricio Álvarez-Mesa.

Spin Digital is a new company being created as a spin-off from the AES group of TU Berlin. The main focus of Spin Digital is to make high quality video affordable. The technology provided by Spin Digital allows to improve the quality of digital video, and at the same time reduce its size (bitrate), while keeping the complexity under control. By using advanced video codecs (e.g. HEVC/H.265) and exploiting the parallelism found in latest generation of microprocessors Spin Digital software allows new type of applications such as Ultra High Definition (UHD) TV, and improved the Quality of Experience for High Definition video over the Internet. Other applications that can benefit include online games, remote desktop systems, video conferencing, digital cinema, and video for industrial/medical applications. In this talk we will present the motivation and initial plan of the company.

## April 3th, 2014-10:00am: Design and Validation of a heterogeneous Cluster-Node, Marcus Schubert

In the following master thesis an high-performance hardware design for embedding in a distributed cluster system is planed. On the basis of relevant researches on the current technical conditions, a requirement profile for the cluster nodes is first derived and a functional specification is created. Later use as a research and development platform also gives rise to additional features that must be considered. The design of a suitable system structure and the selection of matching hardware components for the cluster nodes form the next design section. Specify the dimensioning of a sufficient supply and cooling system as well as the creation of an electronic schematic finally determine the system properties of the cluster node. The planning of the board dimensions, the determination of the access points to the system, the component placement and the circuit wiring convert the design into a PCB layout. The functional properties are finally checked by using several system validation techniques.

## Feb 27th, 2014: Polyphone Gitarrenakkorderkennung - Martin Schürer.

This bachelor thesis deals with a branch of the automatic music transcription with regard to the recognition of polyphonic guitar chords. Since the process of transcription is very time consuming, an automated solution provides a relief for the exchange of songs and melodies between musicians, as well as support for teaching practices. This work presents an algorithm that analyses polyphonic tones contained in a single-track guitar signal, in order to define accompanying chords. The algorithm was implemented in MATLAB and is based on frequency analysis. So that a chord, which is contained in a signal, can be identified by spectral analysis. In this case the analyzed note can be an individually played string or a chord with up to six simultaneously played strings. Furthermore, an onset-detection was implemented, which recognizes the beginning and the end of a note. The algorithm is verified by using synthetically generated and self-recorded chords. The aim of this algorithm is the correct detection of individually played strings, chords with minor complexity (with two to three simultaneously played strings) as well as more complex ones (up to six simultaneously played strings). Based on the test cases, which have been used, the accuracy of note recognition is approximately 94 percent. The proper detection of a single strummed string requires an average time of 104 ms. For simple chords, the algorithm requires 202 ms and for complex chords 244 ms on average. The onset-detection is verified by a test riff and achieves a recognition rate of 93 percent.

## Feb 13th, 2014-10:30am-Interfacing an Embedded Vision System to Embedded Linux on a Zynq FPGA - Konstantinos Gavriilidis.

Real-time stereo vision systems are deployed by several computer engineering applications, from intelligent robotics to automated systems. Stereo vision is an intensive computing task, as it requires to compute the correlation between a pair of stereo frames. In order to achieve a fast correlation calculation which provides acceptable results in real-time, an efficient system design needs to be developed. In this thesis, a hardware-software co-design approach is presented, which makes use of the synergy of hardware and software. An already existing real-time vision system was extended, to calculate the depth of a recorded object. This vision system consists of a Zynq SoC which obtains the stream of a VmodCAM module. The work conducted for this thesis can be divided in three parts. First, a Linux driver was implemented to enhance the software capabilities of the Zynq. Then, the hardware configuration of the Zynq was extended to support stereo streaming. Finally, a stereo depth calculation algorithm, which makes use of the driver to obtain a stereo stream in real-time, was implemented in software. The proposed platform design and algorithm were evaluated in terms of accuracy and performance and several enhancements are proposed to improve both of them.

## Feb 13th, 2014-10:00am- Statisches Scheduling auf einem echtzeitbasiertem Multi-Core System- Mirko Liebender.

In this thesis we consider the possibilities for implementing a static scheduler on safety critical multi-core real-time systems. Static schedulers are mainly used in domains like automotive and avionics. Among other things they can be used to guarantee hard real-time on modern multi-core processors.

In the course of this thesis a simple static scheduler for the operating system VxWorks 6.9 was implemented without changing the kernel itself. Therefore the static scheduler is located one layer above the operating systems scheduler. The concept of that scheduler enables the usage of any correct static scheduling plan without needing to adjust the code of the custom programs that it runs. In multiple experiments on a QorIQ P4080 machine the functionality, correctness and possible overhead of the implemented static scheduler was measured and evaluated. A correct and punctual execution could be proven and a small overhead of the scheduling routine was detected. The reinitialization of each process after being finished is however more or less unpredictable and disqualifies the static scheduler for safety critical usage without any further insight on the operation system code. In this thesis a working static scheduler for VxWorks 6.9 could be realized with a few limitations concerning predictability.

## Feb 10th, 2014: Mauricio Álvarez-Mesa gives an invited talk at University of Castilla-La-Mancha, Albacete, Spain.

Dr. Mauricio Álvarez-Mesa will give an invited talk at the Department of Computer Science of University of Castilla-La-Mancha, in Albacete, Spain on Monday February 10th. The title of the talk is: "Parallel Video Decoding: Experiences with H.264 and HEVC". In this talk Dr Álvarez-Mesa will present the latest results of the AES research group on video decoding using parallel architectures. Dr Álvarez-Mesa also will have a meeting with researchers from the Computer Architecture and Technology group at the Albacete Research Institute of informatics (I3A) about possible joint projects.

## Jan. 16, 2014-10:40am- Power and Energy Efficiency of Video Decoding on Multi-core Architectures- Chi Ching Chi.

In this talk we present how modern power states influence the power consumption of realtime HEVC video decoding. In realtime applications a set amount of operations has to be performed within a time frame, allowing the processor to go idle when the task has been performed. Processor architectures and offchip memory have incorporated many low power states, which allow the processor to consume less energy at lower activity levels. On x86 processors this has resulted primarily in so called P-States and C-states, which control the power consumption when active and idle, respectively. Each of these states have different transition times, power consumption, and performance level, introducing a new problem of choosing when to use which state. In research, conflicting strategies such as ?race to idle? and running longer at lower clock has been proposed as the best solution. Evaluation has been performed to for finding which technique is better for HEVC decoding. Analysis has been performed on different systems ranging from desktop to ultra-mobile platforms.

## Jan. 16, 2014-10:20am- DART: A Decoupled Architecture Exploiting Temporal SIMD- Jan Lucas.

GPUs can offer very high performance and good energy efficiency on some applications. Many applications, however, do not perform well. The high area and energy efficiency is reached by grouping threads into groups called warps and running the threads from one warp in lockstep. This way with only one instruction per warp fetched, decoded and issued up to warp length operations can be executed. In conventional GPU implementations spatial SIMD units are used to execute warps. This results in underutilization of the execution units, if the threads from one warp are following different control flow paths(branch divergence). This talk presents DART, a novel architecture for GPUs based on Temporal SIMD. By using a temporal implementation of SIMD it can offer better utilization of the execution units with branch divergence. The details of the DART architecture will be explained and benchmark results comparing DART GPU and conventional GPUs will be presented.

## Jan. 16, 2014-10:00am- Parallel H.264/AVC Motion Compensation for GPUs using OpenCL- Biao Wang.

Motion Compensation (MC) is one of the most compute-intensive parts in H.264/AVC video decoding. It exposes massive parallelism which can reap the benefit from Graphics Processing Units (GPUs). However, the divergence caused by different interpolation modes in MC can lead to significant performance penalty on GPUs. In this work, we propose a novel multi-stage approach to parallelize the MC kernel for GPUs using OpenCL. The proposed approach mitigates the divergence by exploiting the fact that different interpolation modes share common computation stages. In addition, the optimized kernel has been integrated into a ffmpeg decoder that supports H.264/AVC high profile. We evaluated our kernel on GPUs with different architectures shipped by AMD, Intel, and Nvidia. Compared to a CPU implementation, our kernel achieves maximum speedups of 3.27 and 3.59 for 1080p and 2160p videos, respectively. Furthermore, we applied zero copy optimization for integrated GPUs from AMD and Intel to eliminate memory copy overhead between CPU and GPU.

## Jan. 2, 2014- AES publicizes their H264 OpenCL decoder.

To employ the power of GPUs for massive parallel processing, this work offloads parallel kernels in H.264 decoding, namely inverse transform and motion compensation, onto GPUs. At kernel level, significant speedup is observed compared to an highly optimized CPU SIMD implementation.

.

## Dec. 19, 2013- 10:30h, EN 642: Performance portability for low-power embedded GPUs- Guilherme Calandrini.

The GPUs is already a common part of embedded system design, in addition the OpenCL standard already promises a vender neutral programming solution for CPUs, GPUs and DSPs, these advances are well sponsored by portable devices such as smart phones or tablets which have stimulated the emergence of several distinct low-power embedded GPU architecture. This presentation shows the results of an analysis made about performance and power efficiency of several GPUs architectures using a suite of OpenCL micro benchmarks. The results are a key for understanding the performance and power implication of optimization strategies for the different GPU architectures as well a guide for selecting appropriate GPUs according to performance or energy efficiency requirements.

## Dec. 19, 2013- 10h, EN 642: Device Tree-Erweiterung für QEMU- Tim Barnewski.

I will show how to add configurable hardware support using Device Trees for the Versatile Platform Baseboard family of single-board computers to QEMU. To achieve this, I will first provide a more detailed insight into the structure and purpose of Device Trees, as well as some of the inner workings of QEMU that are relevant to this goal. Additionally, an overview of the implementation itself will be provided, and it will cover any surprises, problems or other findings encountered during the process.

## Dec. 17, 2013: Visit to Technische Universität Dresden.

The PhD guest Guilherme Calandrini made a visit to TU Dresden for give a presentation entitled as "Performance Portability and Energy Issues in Computing Architectures" to the group of Operating Systems and Security of Prof. Dr. Hermann Härtig in the faculty of Computer Science, the group has a strong background of high level system development, such as Linux Kernel and virtualization, the visit aimed to present the issues in the development of energy efficiency applications that must handle with different layers of computing architecture (from circuit level, architecture design, operating system and why not the virtual machine). During the visit, he also had the opportunity to make known the LPGPU project to Prof. Dr. Emil Matus from the Vodafone Chair Mobile Communications Systems that works in a heterogeneous SoC for communication applications.

For further information about the talk, see os.inf.tu-dresden.de/EZAG/abstracts/abstract_20131217.xml

## Dec 5, 2013- 10h, EN 642: Multiprotokollfähige Master für Ethernet-basierte Feldbusse - Victor Kozhukhov

Moderne CNC-Steuerungen verwenden spezielle auf Ethernet basierte Protokolle, um die Ansteuerung der Slave-Geäte in Echtzeit zu ermöglichen. Eine CNC-Steuerung von Schleicher Electronic, die bereits in der Lage ist als ein Sercos-III-Master zu operieren, wird um die Funktionalität eines EtherCAT-Masters erweitert. Die Nutzung beider Protokolle soll über die gleichen Ethernet-Anschlüsse der CNC-Steuerung möglich sein. Der Benutzer soll selbstständig entscheiden können, ob die CNC-Steuerung als ein Sercos-III-Master oder ein EtherCAT-Master eingesetzt werden muss. Der Wechsel des Protokolls soll dabei mit einem möglichst geringen Aufwand stattfinden. Dabei sind vor allem Änderungen der Hardware (mit der Ausnahme der Inhalte der programmierbaren Logik) zu vermeiden. Die CNC-Steuerung verwendet einen speziell für Sercos-III-Master optimierten Dual-MAC. Der Dual-Mac des Sercos-III-Masters wird beim Starten der Software als IP-Core auf ein in die CNC-Steuerung integriertes FPGA geladen. Um die Nutzung der CNC-Steuerung als ein EtherCAT-Master zu ermöglichen, wird ein geeignetes Dual-MAC entwickelt. Somit kann beim Starten der Software entschieden werden, ob der Dual-MAC für das Sercos-III-Master oder der Dual-MAC für das EtherCAT-Master auf das FPGA geladen wird. Der Dual-PHY, der in die CNC-Steuerung integriert und mit dem FPGA verbunden ist, ist für beide Protokolle geeignet.

Es wird zusätzlich eine Anpassung der Software benötigt, damit die CNC-Steuerung in der Lage ist als ein EtherCAT-Master zu operieren. Für den Aufbau des EtherCAT-Master-Protokollstacks wird ein EtherCAT-Master-High-Level-Treiber von Acontis Technologies eingesetzt. Der EtherCAT-Master-High-Level-Treiber steht dabei in Form einer vorkompilierten Library zur Verfügung. Die anwendungsspezifische Software ist in der Lage den High-Level-Treiber einzubinden und über eine entsprechende API zu verwenden. Für den Einsatz des High-Level-Treibers, zusammen mit dem selbstständig für den EtherCAT-Master entwickelten Dual-MAC, wird eine Ethernet-Hardwareabstraktionsschicht implementiert.

## Dec 3, 2013: Vice chancellor of Politecnico di Milano visits AES group.

On Tuesday December 3, the vice chancellor of TU Berlin's partner university Politecnico di Milano, Prof. Donatella Sciuto, will visit the AES group. Mrs. Sciuto is a full professor in Computer Engineering at the Dipartimento di Elettronica e Informazione of the Politecnico di Milano. She is Deputy Director of Education at CEFRIEL where she manages the executive companies education training programs. For more information about Prof. Sciuto and her research interests, visit her website at  here.

## Dec 2, 2013, 13h: Crown Scheduling: Energy-Efficient Resource Allocation, Mapping and Discrete Frequency Scaling for Collections of Malleable Streaming Tasks- Prof. Dr. Christoph Kessler.

Time: 1:00 PM, 2 December 2013
Place: 4.064, MAR Building

Abstract:
We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable or malleable tasks on a generic manycore processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin at user level in a data driven way. A stream of data flows through the tasks and intermediate results are forwarded on-chip to other tasks.
In this presentation we introduce Crown Scheduling, a novel technique for the combined optimization of resource allocation, mapping and discrete voltage/frequency scaling for malleable streaming task sets in order to optimize energy efficiency given a throughput constraint. We present optimal off-line algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). Our energy model considers both static idle power and dynamic power consumption of the processor cores.
Our experimental evaluation of the ILP models for a generic manycore architecture shows that at least for small and medium sized task sets even the integrated variant of crown scheduling can be solved to optimality by a state-of-the-art ILP solver within a few seconds. -
We conclude with a short outlook to the new EU FP7 project EXCESS (Execution Models for Energy-Efficient Computing Systems).

Acknowledgements:
This is joint work with Nicolas Melot (Linköping University), Patrick Eitschberger and Jörg Keller (FernUniv. in Hagen, Germany). Partly funded by VR, SeRC, and CUGS.
Based on our recent paper with the same title at Int. Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS-2013), Sep. 2013, Karlsruhe, Germany.

Short Biography:
Christoph W. Kessler (german spelling: Keßler) is a professor for Computer Science at Linköping University, Sweden, where he leads the Programming Environment Laboratory's research group on compiler technology and parallel computing. Christoph Kessler received a PhD degree in Computer Science in 1994 from the University of Saarbrücken, Germany, and a Habilitation degree in 2001 from the University of Trier, Germany.
In 2001 he joined Linköping university, Sweden, as associate professor at the programming environments lab (PELAB) of the computer science department (IDA).
In 2007 he was appointed full professor at Linköping university. His research interests include parallel programming, compiler technology, code generation, optimization algorithms, and software composition. He has published two books, several book chapters and more than 90 scientific papers in international journals and conferences. His contributions include e.g. the OPTIMIST retargetable optimizing integrated code generator for VLIW and DSP processors, the PARAMAT approach to pattern-based automatic parallelization, the concept of performance-aware parallel components for optimized composition, the PEPPHER component model and composition tool for heterogeneous multicore/manycore based systems, the SkePU library of tunable generic components for GPU-based systems, and the parallel programming languages Fork and NestStep.

## 27-28 Nov. 2013: Ben Juurlink gives keynote presentation at ICT.OPEN 2013.

Ben Juurlink has been invited to give a keynote presentation in the Embedded Systems track of ICT.OPEN 2013. ICT.OPEN is the principal ICT research conference in the Netherlands and is held on 27-28 November in Eindhoven. The title of his talk is "Lessons Learnt From Parallelizing Video Decoding". More information about the conference can be found at www.ictopen2013.nl/content/speakers.

## 21.11.13, 10h, EN 642: Manycore Agent-Oriented Programming (MAOP)- Silvano Menk and Robert Hering

In our presentation we want to give a short overview of our bachelor thesis. Therefore we will briefly discuss the current state of parallel programming with special focus on manycore architectures. From this we will deduce our idea for a supposedly intuitive and efficient programming model for manycore architectures, which will be the subject of our thesis. Finally we will propose a coarse working plan and hope for some initial feedback and suggestions.

## November 18-19 2013. Fusing GPU Kernels at HiPEAC Compiler, Architecture and Tools Conference.

A presentation based on a research work, that has been undertaken by Codeplay and TU Berlin’s AES group as part of the LPGPU project, will be presented at this year’s HiPEAC Compiler, Architecture and Tools Conference in Haifa, Israel. The talk is titled “Fusing GPU kernels within a novel single-source C++ API” and will be presented by Paul Keir from Codeplay.

More information about the conference can be found at: http://software.intel.com/en-us/articles/compilerconf2013

Abstract of the talk:

The prospect of GPU kernel fusion is often described in research papers as a standalone command-line tool. Such a tool adopts a usage pattern wherein a user isolates, or annotates, an ordered set of kernels. Given such OpenCL C kernels as input, the tool would output a single kernel, which performs similar calculations, hence minimising costly runtime intermediate load and store operations. Such a mode of operation is, however, a departure from normality for many developers, and is mainly of academic interest.

Automatic compiler-based kernel fusion could provide a vast improvement to the end-user's development experience. The OpenCL Host API, however, does not provide a means to specify opportunities for kernel fusion to the compiler. Ongoing and rapidly maturing compiler and runtime research, led by Codeplay within the LPGPU EU FP7 project, aims to provide a higher-level, single-source, industry-focused C++-based interface to OpenCL. Along with LPGPU's AES group from TU Berlin, we have now also investigated opportunities for kernel fusion within this new framework; utilising features from C++11 including lambda functions; variadic templates; and lazy evaluation using std::bind expressions.

While pixel-to-pixel tranformations are interesting in this context, insomuch as they demonstrate the expressivity of this new single-source C++ framework, we also consider fusing transformations which utilise synchronisation within workgroups. Hence convolutions, utilising halos; and the use of the GPU's local shared memory are also explored.

A perennial problem has therefore been restructured to accommodate a modern C++-based expression of kernel fusion. Kernel fusion thus becomes an integrated component of an extended C++ compiler and runtime.

## 7.11.13, 10:30h, EN 642: Automatic Code Generation for a Microblaze system with ARM NEON SIMD Acceleration - Ilias Timon Poulakis

SIMD (Single instruction, Multiple data) accelerators are increasingly deployed in modern CPU architectures. These units can efficiently process certain data, e.g. mulimedia formats, improving CPU performance and energy consumption. The research department Embedded Systems Architecture(AES) of the Berlin Institute of Technology currently utilizes the Microblaze processor by XILINX, which does not sup- port SIMD acceleration natively. Hence, an ARM NEON compatible SIMD accelerator has been attached to the Microblaze processor. The two units communicate through a protocol based on FSL (Fast Simplex Link). To efficiently use this peculiar architecture, automatic code generation is needed. Yet, creating a custom compiler is difficult and utterly time-consuming. In order to avoid this route, this thesis presents an alternate approach in which merely existing compiler backends are used. The main idea is to create machine code for both Microblaze and ARM NEON separately, using their respective existing compiler backends. Code sections executable by ARM NEON have to be located, then be appropriately inserted into the Microblaze code. In the wake of this thesis, a tool that performs these tasks has been successfully imple- mented, tested and evaluated. This paper focuses on the realization steps taken.  The capabilities of the implemented tool are discussed, and an outlook is given on how the approach could be utilized for a different combination of processor and SIMD accelerator.

## 7.11.13, 10h, EN 642: Instruction compression for the synZEN architecture- Tammo Johannes Herbert

This thesis is focused on the compression of the synZEN architecture's command. The synZen architecture is a parallel architecture following the MIMD (multiple instruction, multiple data) principle. Parallelism is achieved on instruction level (ILP - instruction level parallelism), the concept of VLIW (Very Long Instruction Word) is being utilized for that. A VLIW may contain NOPs (No Operation). The compression method presented here makes use of the possibility to remove those NOPs. Thus, they are not being saved repeatedly in the instruction memory. Saving of removable NOPs is done once in a central place. NOP instructions in the VLIWs may be saved by taking advantage of this redundancy. Therefore, the average size of the VLIWs may be reduced. This requires two substeps. First, the opcope needs to be compressed at compile time. This step is implemented purely in software. Second, after the compression is done, the hardware needs to be altered accordingly for decompression. The decompression happens at runtime. Therefore, the resource utilization of the hardware is faced with the degree of compression in the software. In the concluding evaluation, the required balance for a preferably high effectivity is described.

## 29–30 October 2013: Ben Juurlink @ Cyber-Physical Systems: Uplifting Europe's Innovation Capacity.

Ben Juurlink is currently visiting this two-day event in Brussels which is devoted to explore the innovation potential of Cyber-Physical Systems (CPS). This event is organized by the European Commission and discusses how EU Research and Innovation Programmes can stimulate the creation of new industrial platforms led by EU-actors and facilitate the matchmaking between future user/customer needs and technology offers. For more information, see http://www.amiando.com/cps-conference.html.

## October 14, 2013: Prof. Juurlink is a member of the PhD defense committee of Yifan He.

Prof. Juurlink is visiting Eindhoven, NL, where he is a member of the PhD defense committee of Yifan He.

Yifan He defends his dissertation entitled "Low Power Architectures for Streaming Applications".

## 24.10.13, 11h, EN 642: Design and Implementation of a high-throughput CABAC Hardware Accelerator for the HEVC Decoder- Philipp Habermann.

HEVC is the new video coding standard of the Joint Collaborative Team on Video Coding. As in its predecessor H.264/AVC, Context-based Adaptive Binary Arithmetic Coding (CABAC) is a throughput bottleneck. Due to strong low-level data dependencies, there is only a very small amount of data level parallelism that can be exploited by using the SIMD extensions of current computer architectures. A high-level parallelization is possible in HEVC, but not mandatory. That is why another optimization strategy has to be developed that can be used independently from the input video. Attention was paid for throughput improvements during the standardization of HEVC to address this issue. The goal of this thesis is to evaluate the hardware acceleration opportunities for the highly sequential HEVC CABAC by exploiting the throughput improvements. The evaluation is limited to transform coefficient decoding, as it is the most time consuming part of CABAC. The hardware accelerator is implemented on the Digilent ZedBoard, a development board that contains a 667 MHz ARM Cortex-A9 processor together with a closely coupled FPGA and thereby allows efficient hardware-software co-design. The implemented hardware accelerator processes 70 Mbins/s at 75.36 MHz and achieves an 11× speed-up over software transform coefficient decoding for a typical workload. The hardware accelerator has also been integrated in a complete HEVC software decoder but due to the current slow hardware-software interface, the overall speed-up is relatively small. However, as the data transfer between hardware and software can be significantly reduced when a full CABAC hardware accelerator is implemented, this is a promising path to pursue in future work.

## 24.10.13, 10h, EN 642: Design and Implementation of a Hardware Accelerator for HEVC Motion Compensation- Matthias Goebel.

This master thesis focuses on the design and implementation of a motion compensation hardware accelerator for use in HEVC hybrid decoders, i.e. decoders that contain hard- ware as well as software parts. The motion compensation part of the decoding process is especially suited for such an approach as it is the most time consuming part of pure software decoders. Support for high resolutions and frame rates should be combined by the hardware accelerator with a very low demand for resources and power. An optimized software decoder compatible to the reference decoder has been used as a starting point. As a platform the Zynq-7000 All Programmable SoC by Xilinx is used which combines an ARM Cortex-A9 dual-core CPU running at 667 MHz with flexible programmable logic resources similar to those used in FPGAs. After giving some background information on the involved topics a discussion of the design space with a special focus on the level of granularity, the degree of parallelization and the memory access is performed. For the granularity the PU level has been chosen as it offers a good trade-off between performance and complexity. The resulting design is further highlighted and a prototype implemented and validated. For validation the Foreign Language Interface (FLI) of Mentor Graphics’ ModelSim HDL simulator has been used. As an evaluation of the prototype shows promising results, two different memory interfaces (including one using DMA) are added and the complete accelerator integrated into a Zynq-7000 environment. The necessary modifications to the software decoder for both interfaces are discussed and partially performed. A final evaluation shows an expected frame rate of 4.14 FPS for the complete 1080p decoding process when running the accelerator at 100 MHz.

## AES in California: An NVIDIAn returns.

Tuesday, 01. October 2013

As of October 1st, the graduate student Michael Andersch has re-joined AES research. Michael had spent his summer in Santa Clara, California, where he was employed by NVIDIA during a summer internship. As an intern, Michael worked in the GPU Compute Architecture team, building tools and architecture designs to analyze and improve compute application performance on NVIDIA's next-generation GPU designs. Welcome back, Michael!

## Our new employee Philipp Habermann.

Monday, 28. October 2013

We are pleased to welcome Philip Habermann as a new member of our group. He will contribute to the AES-team in research and teaching. Welcome!

## "Recent Advances in Computer Architecture" will take place in room EN 630!

Attention! Room Change! The Course "Recent Advances in Computer Architecture" (0433 L 334) will take place on Tuesdays from 10:00 to 12:00 in room EN 630.

## The Lab excercises for Multicore Architectures will take place in room TEL 206 Li!

Attention! Room Change!

The Lab excercises for the Multicore Architectures course (LV 0433 L 333) will take place on Mondays from 14:00 to 16:00 in room TEL 206 Li.

## The Kurs Computer Arithmetics, Multicore Architectures and Recent Advances in Computer Architectures starts a week later.

The following courses will start a week later (from the October 21, 2013):

- Computer Arithmetics: A Circuit Perspectiv,

- Multicore Architecture and

- Recent Advances in Computer Architecture.

## 10.10.13, 10h, EN 642: Enhancing Cache Organization for Tiled CMP Architectures - Tareq Alawneh.

Many-core processors architecture has become very common nowadays with the leading CPU manufactures (Intel, AMD, and TILERA) focusing on tiled CMP architectures. Our target system assumes a tiled CMP architecture consists of n-core interconnected with 2D mesh switched network. Each tile has a processor core, a private L1-D/I cache, private L2 cache, and router for on-chip data transfers. Each cache block has a home tile which maintains the directory information for that block- the directory keeps track of tiles with copies for that block. On occurring miss in the private L1 and L2 caches respectively, it will request it from home tile. In case of miss happened, it will be handled depending on its specific coherent protocol implementation. The drawback of this design is the possibility of overloading some home tiles with the remote requests which creates a scalability bottleneck. Furthermore, as the processor count increases the L2 miss cache access latency will be dominant by the number of message hops to reach the particular cache rather than the time spent to access the cache itself. These drawbacks can be mitigated when taking into account other access patterns of the data.

In this study, we analyze this problem and propose ways to alleviate its impact on the system performance. One way to improve the system performance of the tiled CMP architecture is to access the L2 cache banks of the adjacent tiles to fetch the requested code cache lines before accessing its assigned home tiles. Realizing such mechanism will reduce the L2 remote cache latency, since the requested code cache lines may be fetched them from L2 caches of nearby tiles instead of L2 caches of its home tiles. Furthermore, the number of accesses for the home tiles will be reduced. These two contributions of our proposed study will be certainly reflected in the improvement of the system performance as a consequence of expected reduction of network utilization and AMAT.

As future work, we propose another way to improve the tiled CMP architectures by migrating hot cache lines closer to requesting tiles.

## October, 6-10: Prof. Juurlink to ICCD conference.

From October 6 to October 10 Prof. Juurlink will visit the IEEE International Conference on Computer Design in Asheville, North Carolina, USA.

He is the chairman of the Processor Architecture track and will also chair the session on Efficient Cache Architectures.

## Sept. 30th 2013. A delegation from Hunan University, China visits AES TU-Berlin.

A delegation from the University of Hunan (one of the oldest and most important national universities in China) will visit the AES group of TU-Berlin on September 30th. They will be introduced to the research activities of the AES group, and discuss opportunities for joint research work. The delegation is composed by 7 faculty members from the School of Computer and Communication lead by professor Renfa Li.

## 13.09.13: AES TU Berlin presents 4k UHD HEVC/H.265 decoding.

The AES group is proud to present its highly efficient 4k Ultra HD capable MPEG-HEVC/H.265 decoder setup. A demo setup is created with a 65 inch Samsung UHD TV and a custom mini PC based on the 4th generation Intel Core processor. Optimization for the latest generation processors allow the compact setup to decode UHD faster than 60 fps even at higher bit depths with no more than two threads.

## 26.09.2013, 10h, EN 642: A Cost-Effective Kite State Estimator for Reliable Automatic Control of Kites​- Johannes Peschel

Airborne Wind Energy (AWE) is a developing technology that uses tethered wings to harvest wind energy and convert it into electrical energy. Most of the AWE concepts that will be presented in this thesis have one common challenge: Estimating the position and the orientation of the kite, also called kite state, especially during highly dynamic flight situations. The focus of this thesis is first, to investigate, if angular sensors are feasible to obtain reliable position data and second, which fusion algorithm can be used to join the data of the angular and Global Navigation Satellite System (GNSS) sensors. The TU Delft prototype is a suitable testing platform for this purpose. The author added angular sensors to the ground station of the TU Delft AWE system that measure the elevation and the horizontal displacement of the tether holding the kite. They are mounted on a modular stainless steel construction, which has low wear and a long lifetime. The author used the tether length and the angular data to obtain a new position. This position was merged with the two GNSS positions that were already attached to the kite. The angular sensors were able to measure with a resolution of <0.01°. The elevation and azimuth position of the kite had an error of less than 0.7° as long as the tether force was higher than 2000N. One of the GNSS sensors provided reliable data during low force phases. A reliable position in all flight conditions could be obtained by using double exponential smoothing prediction to merge both positions. This development enables the implementation of a reliable kite power control system.

## Sept. 22-27, 2013: Prof. Juurlink to ScalPerf workshop.

Prof. Juurlink has been invited to give a presentation at the ScalPerf (Scalable Approaches to High Performance and High Productivity Computing) workshop which will be held in Bertinoro, Italy from Sept. 22 to Sept. 27, 2013. There he will present his recent article "Amdahl's law for predicting the future of multicores considered harmful". For more information about the workshop see http://www.dei.unipd.it/~versacif/scalperf13/index.html. The article can be accessed via ACM Digital Library http://doi.acm.org/10.1145/2234336.2234338.

## September 15, 2013: "HiPEAC grant: Performance portability for low-power embedded GPUs"

The AES group of TU Berlin has received a collaboration grant from HiPEAC for a three month visit of Guilherme Calandrini, a PhD student from the University of Alcala in Spain. The visit will focus on performance portability for low-power embedded GPUs using OpenCL. In this collaboration we aim to create a set of OpenCL benchmarks that can be used to compare the performance and power efficiency of different embedded low-power GPUs. The results of this research will be very useful for understanding the performance and power implications of optimization strategies for different GPU architectures; and also selecting the most appropriate GPUs based on well defined quantitative performance and power metrics.

## September 9, 2013: The AES group will host Mr. Hasan Hassan.

The AES group will host Mr. Hasan Hassan, who is a student at TOBB University of Economics and Technology (http://etu.edu.tr/en) as an intern to work on porting computer vision algorithms to GPUs using OpenCL. The internship will be organized within the framework of the EU Erasmus Programme and will take place between 09-09-2013 and 20-12-2013. We aim to develop several kernels that are used in various computer vision algorithms, with high demand of parallel computation, to the GPU world, which provides a high level of parallel processing.

## September 7, 2013: Prof. Dr. Ben Juurlink in the MuCoCoS-2013

The Paper "Topology-aware Equipartitioning with Coscheduling on Multicore Systems" by Jan H. Schönherr, Ben Juurlink and Jan Richling will be presented in the 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS-2013), which will be held on September 7 in Edinburgh, Scotland, UK, in conjunction with the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT 2013).

MuCoCoS-2013 focuses on language level, system software and architectural solutions for performance portability across different architectures and for automated performance tuning.

## Juli 2013: "best-in-class-award" to Philip Habermann.

Recently the "best-in-class-award" was handed out by Prof. Juurlink to Philip Habermann for the class "Advanced Computer Architecture 2011-2012". The "best-in-class-award" is awarded to the best performing student who achieves the highest grade in Prof. Juurlink's master courses. This year it will also be awarded.

## 14-20 july, 2013: Sohan Lal and Jan Lucas from TU Berlin are going to present two posters at HiPEAC ACACES 2013.

Sohan Lal and Jan Lucas from TU Berlin are going to present two posters at HiPEAC ACACES 2013. The two posters will present some of the most recent research results from the project to the public for the first time.

• The poster “Exploring GPGPUs Workload Characteristics and Power Consumption” by Lal et al. will provide interesting insights into the power consumption of GPU workloads and how they are related to the performance characteristics of the workloads.
• The poster “DART: A GPU architecture exploiting temporal SIMD for divergent workloads” will present first simulation results for DART, an new GPU architecture developed within the LPGPU consortium by Lucas et al.

## 8.07.2013, 13h, EN 642: Implementation and Evaluation of Large Warps in GPUSimPow - Matthias Stroux

Graphic processors (GPU) are a special class of parallel processors for massive parallel programs. From additional processors enhancing graphics intensive programs they have developed into general purpose computing devices for high-performance business and scientific computing. GPU’s typically handle branches by sequentializing the branch paths, which leads to underutilization of their SIMD execution units. Large-Warps (LW) is a concept to increase utilization by selecting threads with the same execution path, PC and program-state, from larger units, called ’Large-Warps’ into temporary units of execution, the ’Sub-Warps’. LW should therefore lead for some programs to a significant increase of SIMD-utilization. Theoretical considerations also show that for some programs there should be an increase of IPC or a decrease of execution cycles possible by a factor of more than two. To test this concept with real programs and in a complete system, where memory latency and network effects can be taken into account and measured, Large-Warps was implemented in a software simulator for GPU’s, GPGPU-Sim 3.x and the new power-simulator GPUSimPow for power effects. Results for ideally constructed synthetic benchmarks show the expected effects: where functional execution of SIMD-units can be increased, IPC increases to. However memory effects and effects of other system parts have to be taken into account to. For a number of ’real-world’ benchmarks the positive effect of Large-Warps on performance (IPC) can be confirmed.

## June 12th 2013: A group of computer engineering students from Universidad del Valle (Colombia) visit AES TU-Berlin

A group of computer engineering students accompanied by professor Dr. Maria Trujillo from Universidad del Valle of the Colombian city of Cali will visit TU-Berlin on June 12th 2013. The German Academic Exchange Program (DAAD) has organized and financed this visit which will allow the students to know the research and teaching activities of the AES-TUB group, and also will be the starting point for future research collaborations.

## June 12th 2013, 10 a.m., EN 185: Achronix tech talk

The outline of the talk is:

• Overview of the FPGA market;
• Achronix in the High End FPGA market;
• Achronix Value Proposition versus incumbent high end FPGA vendors;
• INTEL / Achronix partnership : Current 22 nm Tri-gate process product, 14 nm products 2014/201, 10 nm;
• Products available at Achronix in 2nd H 2013;
• SW tools presentation (video);
• HD1000 demo board ;

## SoSe 13- A New Course: Computer Arithmetic: Circuit Respective

The advance of modern embedded systems, and their high computation capabilities mainly depends on their ability to perform arithmetic operation in an efficient manner. This course is intended to increase the Knowledge about the design of embedded arithmetic circuits as well as the scientific background of these circuits. This will help the students to gain more details about the design of arithmetic processing units and more practical experience in the implementation of digital systems. The students will increase their experience in the use of hardware description languages to model and implement digital systems. The implementation of these circuits using VHDL/FPGA will be included as well.

## 12.04.2013, 10h, EN 642: Bachelor Thesis: High-Throughput Communication Interface for the Xilinx XUPV5 Evaluation Platform -Lester Kalms

Due to the increasing tasks of processors in computer systems and the growing complexity, it is not wrong to outsource some of these tasks to relieve the processor. Some of these tasks are done by expansion cards, such as graphic-cards or sound-cards. These cards communicate nowadays via PCI-Express with the rest of the system. Peripherals cards can of course also support other tasks. A platform for the development of an expansion card is for example provided by Xilinx with the evaluation platform XUPV5 [5]. In order to communicate with the card, an interface is needed on the hardware and on the software side of the communication. This can be developed with this card and the help of the Xilinx tools. In times of ever-increasing amounts of data, a correspondingly high data throughput is needed, which is in theory feasible with PCI-Express. This thesis deals with the development of an interfaces that communicates via PCI-Express and of how to maximize the data throughput. This Thesis has been done to support the work of others, which want to develop an efficient expansion card. The second chapter deals with PCI-Express and explains fundamental things to create a basic understanding. It explains what PCI-Express is, how communication works and what data throughput can be achieved. PCI-Express communicates via packets. These packets are called "Transaction Layer Packets". The third chapter deals with the system which has been developed. It is described how the hardware design works as a whole and in detail and how the hardware design has been implemented. It will also be described how the software system is created and how it works, and especially how these two systems interact with each other. The following chapters include the practical work. The fourth chapter describes how a running system, which satisfies the requirements, has been created. The system described in the previous chapter was able to communicate, but there were still some errors in various situations. It explains what has been done to correct these errors and what did not work and why did it not work. The fifth chapter deals with the increasing of the data throughput and it also includes some measurements. For easier handling and measurement, a user application was implemented. In the final chapter the results will be commented, interpreted and compared with the theory. Finally, there is an outlook on methods that still can be tested or that have not yet been tested completely.

## 8-11 April 2013: a 4K H.265/HEVC real-time decoder at NABShow 2013 in Las Vegas

A 4K H.265/HEVC real-time decoder has been presented at the NABShow in Las Vegas, Nevada, USA during April 8-11 2013. The demo consisted of a software based decoder running a multicore PC connected to a 4K 84 inches TV. It was presented at the Fraunhofer HHI Booth C7843. The real-time decoder has been developed as a part of a collaborarion between the Fraunhofer Heinrich Hertz Institute (HHI) and the AES group of TU-Berlin. The demo was presented by Benjamin Bross from Fraunhofer HHI and Mauricio Alvarez-Mesa from Fraunhofer HHI and TU-Berlin.

## 14/03/2013, 10h, EN 642: "Migen - a Python toolbox for building complex digital hardware"-Sébastien Bourdeauducq

Despite being faster than schematics entry, hardware design with Verilog and VHDL remains tedious and inefficient for several reasons. The event-driven model introduces issues and manual coding that are unnecessary for synchronous circuits, which represent the lion's share of today's logic designs. Counter-intuitive arithmetic rules result in steeper learning curves and provide a fertile ground for subtle bugs in designs. Finally, support for procedural generation of logic (metaprogramming) through "generate" statements is very limited and restricts the ways code can be made generic, reused and organized.
To address those issues, we have developed the Migen FHDL library that replaces the event-driven paradigm with the notions of combinatorial and synchronous statements, has arithmetic rules that make integers always behave like mathematical integers, and most importantly allows the design's logic to be constructed by a Python program. This last point enables hardware designers to take advantage of the richness of the Python language - object oriented programming, function parameters, generators, operator overloading, libraries, etc. - to build well organized, reusable and elegant designs.
Other Migen libraries are built on FHDL and provide various tools such as a system-on-chip interconnect infrastructure, a dataflow programming system, a more traditional high-level synthesizer that compiles Python routines into state machines with datapaths, and a simulator that allows test benches to be written in Python.
URL:  http://milkymist.org/3/migen.html

## 02/01/2013 - 11 a.m.: Master school - ICT Innovation - information event

Invitation to the information event

On February, 1st 2013 at 11 a.m. an information event for the "European Dual Degree Master in ICT innovation" will take place in room TEL AB (Telefunken tower). We would like to invite all interested students, and especially those who will finish their BSc degree until August 2013.

The dual degree master program "ICT Innovation" will start in the winter term 2013/14. The application deadline is April, 15th 2013.

## 31.01.2013, 10h, EN 642: "Composing Execution Times on Multicore Processors" - J. Reinier van Kampenhout

The use of multicore processors in embedded systems promises to reduce the space, weight and power requirements while offering increased functionality. To enable these benefits however, a runtime environment must be able to execute multiple safety-critical applications in parallel with non-critical applications. An underlying problem in multicores is the use of shared resources, which leads to interference between applications and unpredictable timing behaviour which is not acceptable for critical applications with hard real-time requirements.

In this research we will conceive and implement a concept for the execution of real-time applications on multicore processors with a composable timing behaviour. In our approach we decompose applications into basic blocks whose, behaviour is deterministic and can be determined empirically. Using models that capture the essential properties of the HW and SW we construct a deployment scheme out of these blocks. The result is a system on which multiple mixed-criticality applications are executed in parallel, each of which has a timing behaviour that is composed out of that of its basic blocks. Thus our method guarantees isolation between applications and simplifies worst case execution analysis, independent of the hypervisor or OS. The usage of resources can furthermore be optimized by allocating any unused resources dynamically to non-critical applications at run time. We will prove the effectiveness of our concept by comparing the variation in execution times to those achieved with purely static scheduling, fixed-priority scheduling and virtualization.

## 21.01.2013: Mr. Tamer Dallou has won a “Best Poster Award” at the HiPEAC 2013.

Mr. Tamer Dallou has won a “Best Poster Award” for our joint poster “Nexus++: A hardware Task Manager for the StarSs Programming Model” at the 8th International Conference on High-Performanceand Embedded Architectures and Compilers HiPEAC 2013, January 2013, Berlin, Germany.

Abstract:
Recently, several programming models have been proposed that try to relieve parallel programming. One of these programming models is StarSs. In StarSs, the programmer has to identify pieces of code that can be executed as tasks, as well as their inputs and outputs. Thereafter, the runtime system (RTS) determines the dependencies between tasks and schedules ready tasks onto worker cores. Previous work has shown, however, that the StarSs RTS may constitute a bottleneck that limits the scalability of the system and proposed a hardware task manager called Nexus to eliminate this bottleneck. Nexus has several limitations, however. For example, the number of inputs and outputs of each task is limited to a fixed constant and Nexus does not support double buffering. Here we present Nexus++ that addresses these as well as other limitations. Experimental results show that double buffering achieves a speedup of $54\times$, and that Nexus++ significantly enhances the scalability of applications parallelized using StarSs.

## Jan. 2013 : HiPEAC 2013 in Berlin.

The HiPEAC conference will be held in Berlin  from Monday 21 to Wednesday January 23, 2013. The HiPEAC conference is the premier forum for experts in computer architecture, programming models, compilers and operating systems for embedded and general-purpose systems in Europe. In 2013 the general chairs will be Ben Juurlink of TU Berlin and Keshav Pingali of the University of Texas, Austin. Program chairs are André Seznec of INRIA Rennes and Lawrence Rauchwerger of Texas A&M University. Paper selection is performed by the ACM journal TACO. More than 500 people attended the HiPEAC 2012 conference in Paris. Hopefully HiPEAC 2013 will be as successful. For more information, stay tuned at http://www.hipeac.net/conference/berlin.

## 2012: Book "Scalable Parallel Programming Applied to H.264/AVC Decoding"

The book titled  "Scalable Parallel Programming Applied to H.264/AVC Decoding" co-authored by Ben Juurlink, Mauricio Alvarez-Mesa, Chi Ching Chi, Arnaldo Azevedo, Cor Meenderinck and Alex Ramirez has been published by Springer as part of the series SpringerBriefs in Computer Science. The book can be purchased from several internet retailers. More information can be found at Springer webpage: http://www.springer.com/engineering/signals/book/978-1-4614-2229-7

## Nov. 12, 2012: "SynZEN: A Hybrid TTA/VLIW Architecture with a Distributed Register File" ar the NORSHIP 2012

The paper "SynZEN: A Hybrid TTA/VLIW Architecture with a Distributed Register File" by S. Hauser, N. Moser, B. Juurlink, accepted at the NORCHIP - The Nordic Microelectronics event 2012 which will be held in Copenhagen, Denmark, on Nov. 12 - Nov. 13 2012. More information about NORCHIP 2012 can be found at www.norchip.org.

## Oct 23th, 11h, KIT: High Efficiency Video Coding on Multi- and Many-core Architectures by M. Alvarez Mesa

Dr. Mauricio Alvarez-Mesa will give an invited talk at Karlsruhe Institute of Technology titled "High Efficiency Video Coding on Multi- and Many-core Architectures" in which he will present the latest results of the AES research on HEVC decoding on parallel architectures. The talk will be held on October 23th, at 11:00 am at Karlsruher Institut für Technologie (KIT), Institut für Prozessdatenverarbeitung und Elektronik (IPE), Karlsruhe, Germany.

## 11. Oct 2012- 10h-EN 642: Scalable Runtime and OS Abstractions for Mesh-Based MultiCores (Prof. Frank Mueller)

Current trends in microprocessors are to steadily increase the number of cores. As the core count increases, the network-on-chip (NoC) topology has changed from buses over rings and fully connected meshes to 2D meshes.

This work contributes NoCMsg, a low-level message passing abstraction over NoCs. NoCMsg is specifically designed for large core counts in 2D meshes. Its design ensures deadlock free messaging for wormhole Manhattan-path routing over the NoC. Experimental results on the TilePro hardware platform show that NoCMsg can significantly reduce communication times when compared with other NoC-based message approaches. They further demonstrate the potential of NoC messaging to outperform shared memory abstractions, such as OpenMP, as core counts and inter-process communication increase.

This work further explores the benefits of novel runtime and operating systems abstractions for large scale multicores. On top of NoCMsg, a distributed OS abstraction is promoted instead of the traditional shared memory view on a chip. This distributed kernel features a pico-kernel per core. Sets of pico-kernels are controlled by micro-kernels, which are topologically centered within a set of cores. Cooperatively, micro-kernels comprise the overall operating system in a peer-to-peer fashion.

Biography: Frank Mueller () is a Professor in Computer Science and a member of multiple research centers at North Carolina State University. Previously, he held positions at Lawrence Livermore National Laboratory and Humboldt University Berlin, Germany. He received his Ph.D. from Florida State University in 1994. He has published papers in the areas of parallel and distributed systems, embedded and real-time systems and compilers. He is a member of ACM SIGPLAN, ACM SIGBED and a senior member of the ACM and IEEE Computer Societies as well as an ACM Distinguished Scientist. He is a recipient of an NSF Career Award, an IBM Faculty Award, a Google Research Award and a Fellowship from the Humboldt Foundation.</pre><pre>

## 11.Oct 2012: Courses in WS2012/13

The Course information for the current semester is online . We would particularly like to report the new course AES for bachelor students.

## Sept 30- Oct 3, 12:"Improving the Parallelization Efficiency of HEVC Decoding" at the ICIP 2012

The paper "Improving the Parallelization Efficiency of HEVC Decoding"
by C. C. Chi, M. Alvarez-Mesa, B. Juurlink, V. George and T. Schierl has been accepted at the 2012 IEEE International Conference on Image Processing (ICIP) which will be held in Orlando, Florida, USA, on Sept. 30 - Oct. 3 2012. This paper is the second of a collaboration between the AES group and the Multimedia Communications Group of the Fraunhofer HHI Institute on the topic of parallel  processing for HEVC. More information about ICIP-2012 can be found at http://icip2012.com.

## Sept 10, 2012: "Hardware-Based Task Dependency Resolution for the StarSs Programming Model" at SRMPDS'12

The paper "Hardware-Based Task Dependency Resolution for the StarSs Programming Model" by Tamer Dallou and Ben Juurlink has been accepted at the "SRMPDS'12 - Eighth International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems", which will be held in conjunction with "ICPP'12 - The 2012 International Conference on Parallel Processing" in Pittsburgh, PA on September 10, 2012.
This paper is a result of the research conducted at AES as part of the ENCORE project. More information on SRMPDS can be found at:
http://www.mcs.anl.gov/~kettimut/srmpds/

## Sept 5-8, 2012: "A Novel Predictor-based Power-Saving Policy for DRAM Memories" at the 15th EUROMICRO Conference on Digital System Design (DSD)

The paper "A Novel Predictor-based Power-Saving Policy for DRAM Memories" by Gervin Thomas, Karthik Chandrasekar, Benny Akesson, Ben Juurlink and Kees Goossens has been accepted at the 15th EUROMICRO Conference on Digital System Design (DSD), Cesme, Izmir, Turkey on September 5th - September 8th, 2012. This paper is a collaboration between the AES group (TU-Berlin) and Electronic Systems group (TU Eindhoven). More information about DSD-2012 can be found at http://www.univ-valenciennes.fr/congres/dsd2012/.

## August 27, 2012: "An Optimized Parallel IDCT on Graphics Processing Units" at HeteroPar'2012

The paper "An Optimized Parallel IDCT on Graphics Processing Units" by Biao Wang, Mauricio Alvarez-Mesa, Chi Ching Chi, and Ben Juurlink has been accepted at the 2012 International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar'2012) which will be held in Rhodes Island, Greece on August 27, 2012. The paper presents the work of offloading H.264 IDCT kernel to the GPUs which has been conducted at AES as part of the LPGPU project. More information on HeteroPar can be found at http://pm.bsc.es/heteropar12/.

## July 31, 2012: AES has setup a testbed to accurately measure GPU power consumption.

AES has setup a testbed to accurately measure GPU power consumption. This testbed is being used to evaluate power reduction techniques on available GPUs. It will also be used to validate the power modeling of GPUSimPow, the GPU power simulator developed within the LPGPU project. Its high bandwidth and high sampling speeds enable it to accurately measure short, sub-ms power events.
The AES developed measurement software allows developers to pinpoint power consumption down to the individual kernel.

## 16-19 july, 12: "Using OpenMP Superscalar for Parallelization of Embedded and Consumer Applications" at the SAMOS XII

The paper "Using OpenMP Superscalar for Parallelization of Embedded and Consumer Applications" by M. Andersch, C.C. Chi and Ben Juurlink has been accepted at the 2012 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS) which will be held in Samos, Greece on July 16.-19. 2012. The paper is the latest of the research concerning the OpenMP Superscalar programming model which has been conducted at AES as part of the ENCORE project. More information on SAMOS can be found at http://samos.et.tudelft.nl/samos_xii/html/.

## July 11, 2012: "Nexus++: A hardware Task Manager for the StarSs Programming Model" at ACACES'12

The poster "Nexus++: A hardware Task Manager for the StarSs Programming Model" by Tamer Dallou and Ben Juurlink has been presented at the "ACACES'12 - Eighth International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems", which was
held in Fiuggi, Italy, on 8-14 July, 2012.
This poster presents some of the results of the research conducted at AES as part of the
ENCORE project. More information on ACACES'12 can be found at:
http://www.hipeac.net/summerschool/

## 8-14 july, 12: Mr. Tamer Dallou attends ACACES 2012.

Mr. Tamer Dallou was awarded a HiPEAC grant to attend the Eighth International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems ACACES 2012, 8-14 july, 2012, Fiuggi, Italy.

The "HiPEAC Summer School" is a one week summer school for computer architects and compiler builders working in the field of high performance computer architecture and compilation for embedded systems. The school aims at the dissemination of advanced scientific knowledge and the promotion of international contacts among scientists from academia and industry.

## AES group purchases TILE-Gx36 many-core.

The AES group of TU Berlin has purchased a state-of-the-art TILE-Gx36 many-core with 36 64-bit processor cores (tiles) from Tilera (tilera.com). Soon researchers and students of AES will be able to work on this state-of-the-art many-core processor. For more information about the TILE-Gx processor family, see http://tilera.com/products/processors/TILE-Gx_Family.

## May 12: Article in ACM SIGARCH Computer Architecture News.

Ben Juurlink and his PhD graduate Cor Meenderinck have published an article entitled "Amdahl's Law for Predicting the Future of Multicores Considered Harmful" in the current (May 2012) issue of Computer Architecture News, which is published by the ACM Special Interest Group on Computer Architecture (SIGARCH) [http://www.sigarch.org/]. In the article they consider how the predictions in the influential paper of Hill and Marty [1] change when instead of Amdahl's Gustafson's law is assumed. They also propose a different scaling equation called Generalized Scaled Speedup Equation (GSSE) that encompasses Amdahl's as well as Gustafson's law. [1] Mark D. Hill, Michael R. Marty: Amdahl's Law in the Multicore Era. IEEE Computer 41(7): 33-38 (2008)

## 16 May 12: Prof. Dr. Ben Juurlink in the Map2MPSoC/SCOPES.

Prof. Ben Juurlink will give an invited keynote at the 5th Workshop on Mapping of Applications to MPSoCs and 15th International Workshop on Software and Compilers for Embedded Systems, which will be held May 15-16 in the beautiful Schloss Rheinfels hotel at St. Goar, Germany (http://www.scopesconf.org/scopes-12/)

## 10. Mai 12-10h - Room EN 642: REFLEX (Richard Weickelt)

REFLEX is a framework for deeply embedded control systems. It is based upon the event-flow model, which greatly supports component-centric development of concurrent applications. In combination with multiple scheduling directives, interrupt handling and power management facilities, developers can create applications that are both, deadlock-free and totally predictable.

The library is implemented in C++ and benefits from its powerful language features. Only few parts are platform dependent and can be ported to new architectures with very little effort. A standard compiler like g++ is the only requirement.

REFLEX was developed at the TU Cottbus and is released under the BSD license. In this meeting You will get a brief overview on the framework and its features. After a case study about a real-world product, future research challenges will be discussed.

## 12.04.2012- 10h: Online satellite image processing (Kristian Manthey)

Herr Kristian Manthey wird am 12.04.2012 um 10 Uhr im Rahmen unseres Forschungstreffen einen Vortrag zum Thema: Online satellite image processing (Realtime Image compression on reconfigurable Hardware)  halten. Raum: EN 642.

Abstract: There are challenging requirements on optical systems in spaceborne missions. In the last years, the spatial as well as the spectral resolution of the image data increased resulting in a tremendous increase in data rate. There are also requirement to image quality and constraints resulting from the environment in which the system should be used. An optical system for spaceborne application must have a very high reliability, low power consumption as well as a low weight. The system must be radiation tolerant and able to operate in vacuum and in a high temperature range. With the decrease of the ground sample distance (GSD) or the increase of swath, the amount of data increases significantly. Due to the limitation of transmission bandwidth to the ground station, it is necessary to compress the data. Depending on the requirements of the mission, lossless or lossy compression schemes can be used. Image Compression itself is based on the removal of redundant information in the image, such as spatial or statistical redundancy or of the removal of information not needed in the further processing. Image compression architectures consist of spatial decorrelation to remove spatial redundancy, in case of lossy compression followed by quantization and finally entropy coding to remove statically redundancy. Spatial decorrelation in typical space mission is done by prediction (DPCM), discrete cosine transform (DCT) or discrete wavelet transform (DWT). To achieve best compression results, inter-band decorrelation techniques are necessary. This is obvious because image data has correlation between bands or when using multi spectral sensors (MS) in combination with a sensor which is sensitive in all MS channels.  In the DLR, it is planned to develop a satellite camera which does all tasks - image acquisition, pre-processing, compression, storage, data formatting and communication with the ground station - on a single multi-chip-module (MCM). In a first step, the image compression should be done directly on the image acquisition module. The goal of this thesis is to investigate scenarios, where the ground station interactively requests and decompresses the image data, and to develop a high-speed image compression system on the image acquisition module.

## 12.04.2012- 10h: Online satellite image processing (Kristian Mathey)

There are challenging requirements on optical systems in spaceborne missions. In the last years, the spatial as well as the spectral resolution of the image data increased resulting in a tremendous increase in data rate. There are also requirement to image quality and constraints resulting from the environment in which the system should be used. An optical system for spaceborne application must have a very high reliability, low power consumption as well as a low weight. The system must be radiation tolerant and able to operate in vacuum and in a high temperature range. With the decrease of the ground sample distance (GSD) or the increase of swath, the amount of data increases significantly. Due to the limitation of transmission bandwidth to the ground station, it is necessary to compress the data. Depending on the requirements of the mission, lossless or lossy compression schemes can be used. Image Compression itself is based on the removal of redundant information in the image, such as spatial or statistical redundancy or of the removal of information not needed in the further processing. Image compression architectures consist of spatial decorrelation to remove spatial redundancy, in case of lossy compression followed by quantization and finally entropy coding to remove statically redundancy. Spatial decorrelation in typical space mission is done by prediction (DPCM), discrete cosine transform (DCT) or discrete wavelet transform (DWT). To achieve best compression results, inter-band decorrelation techniques are necessary. This is obvious because image data has correlation between bands or when using multi spectral sensors (MS) in combination with a sensor which is sensitive in all MS channels.  In the DLR, it is planned to develop a satellite camera which does all tasks - image acquisition, pre-processing, compression, storage, data formatting and communication with the ground station - on a single multi-chip-module (MCM). In a first step, the image compression should be done directly on the image acquisition module. The goal of this thesis is to investigate scenarios, where the ground station interactively requests and decompresses the image data, and to develop a high-speed image compression system on the image acquisition module.

## 26-27 March 2012: Prof. Dr. Ben Juurlink and Sean Halle present their progress in the LPGPU project in Cambridge

Prof. Dr. Ben Juurlink and Sean Halle are going to Cambridge for the first LPGPU face to face meeting, on March 26 and 27.  They will discuss interactions between the work-packages, the low-power industry-space, and tackle simulator questions.  Each participant is going to present their progress in the first year of LPGPU in preparation for the first-year review.

## 25-29 Feb. 2012 : Michael Andersch presents a poster at PPoPP in New Orleans.

The paper "Programming Parallel Embedded and Consumer Applications in OpenMP Superscalar" by Michael Andersch, Chi Ching Chi, and Ben Juurlink was accepted as a poster presentation at the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). The student Michael Andersch will present the poster in New Orleans from February 25 to February 29, 2012. For more information about the PPoPP conference, see http://dynopt.org/ppopp-2012/.

## 1.01.12 - 15.02.1212: EIT ICT Master School geht an den Start

Die Bewerbungsphase für die neue Master School läuft vom 1. Januar bis 15. Februar 2012. Weitere Informationen unter eitictlabs.masterschool.eu

## 10.01.2011 -16h: A System-Level Approach to Parallelism (Sean Halle)

Vortragsankündigung: A System-Level Approach to Parallelism (Sean Halle)
Dienstag, den 11. Januar 2011 um 16 Uhr im E-N 360.

## 31.03.2010: Lehrangebot im SS2010

Das Lehrangebot unseres Fachgebietes kann im Bereich Studium und Lehre eingesehen werden. Besonders hinweisen möchten wir auf das Master Modul "Advance Computer Architectures" für Informatiker und Technische Informatiker, welches in diesem Semester erstmalig angeboten wird.

## 8.03.10- 10h : New architectures for the final scaling of the CMOS world (Professor Luigi Carro)

Vortragsankündigung: New architectures for the final scaling of the CMOS world (Professor Luigi Carro). Montag, den 08.03.2010 10 Uhr im FR5516.

## 17.02.2010 - 10h: Evaluation of Parallel H.264 Decoding Strategies on the Cell Broadband Engine (Mr. Chi Ching Chi)

Vortragsankündigung:Evaluation of Parallel H.264 Decoding Strategies on the Cell Broadband Engine (Mr. Chi Ching Chi). Mittwoch, den 17.02.2010 10 Uhr im FR 3043.

## 12.01.2010: Mündliche Prüfung in TechGI2 (2. Wiederholungsprüfung)

Das Modul Technische Grundlagen der Informatik 2 (TechGI2) wird ab SS 2010 von dem neuen Leiter des Fachgebiets Architektur eingebetteter Systeme (AES), Prof. Juurlink, übernommen. Er wird dabei einige Veränderungen in der Umsetzung der in der Modulbeschreibung vorgegebenen Inhalte vornehmen, die sich auch in den Prüfungsfragen niederschlagen werden.
Der bisherige Veranstalter des Moduls, Hr. Flik, verliert seine Prüfungsberechtigung zum Ende des WS 2009/10, womit dann die Möglichkeit der mündlichen Prüfung über die bisherigen Inhalte wegfällt.
Für die derzeitigen Interessenten an einer solchen mündlichen Prüfung bietet Hr. Flik Prüfungstermine bis Mitte März 2010 an. Die Prüfungstage werden festgelegt, wenn die ersten Prüfungsanfragen vorliegen (flik(at)cs.tu-berlin.de). Anzugeben sind dabei die Studienrichtung, die Matr.-Nr. sowie der frühest möglich Wunschtermin.
Der eigentliche Prüfungstermin wird erst nach Vorlage der beim Prüfungsamt erforderlichen Prüfungsanmeldung vergeben. Diese Meldung muß wenigstens 7 Tage vor dem Prüfungstermin vorliegen (im Sekretariat von AES oder RT).

## 27.11.2009: Rufannahme von Professor Dr. Ben Juurlink.

Rufannahme von Professor Dr. Ben Juurlink, Professor an der Delft University of Technology, Niederlande, auf die W3-Professur für das Fachgebiet Rechnerarchitektur – Architektur eingebetteter Systeme.