Sie sind hier

# News

## Gesucht: Studentische Hilfskraft mit 41 Monatsstunden und Unterrichtsaufgaben.

Kennziffer: 3434 T 25/14
Bewerbungsfristende: 10.03.2014
Einstellungsdauer: voraussichtlich vom 01.04.2014 bis zum 31.03.2016

Aufgabengebiet:
Mitarbeit in der Lehre im Bachelor-Studium. Betreuung und Vorbereitung für folgende Lehrveranstaltungen: TechGI1: Digitale Systeme, TechGI2/TechGI2TI: Rechnerorganisation, Hardware Praktikum

Anforderungen:
Bachelor Technische Informatik oder Informatik, Abschluss des 3. Semesters und Modulabschlüsse in TechGI1 und TechGI2 bzw. äquivalente Abschlüsse, gute Englisch - und VHDL-Kenntnisse, Bereitschaft zur Einarbeitung in neue Themengebiete

Ihre schriftliche Bewerbung mit Lebenslauf, Immatrikulationsbescheinigung und ggf. aktueller Notenübersicht richten Sie bitte an:

Technische Universität Berlin
Fakultät IV - Elektrotechnik und Informatik
Institut für Technische Informatik und Mikroelektronik (TIME)
Fachgebiet Architektur eingebetteter Systeme (AES)
Sekretariat EN-12
Einsteinufer 17
10587 Berlin

oder per e-Mail:

## Feb. 28th 2014: Paper accepted for "Informatiktage 2014" by German Informatics Society.

Philipp Habermann's paper "Design and Implementation of a High-Throughput CABAC Hardware Accelerator for the HEVC Decoder" was accepted for the proceedings of the "Informatiktage 2014" conference by GI (German Informatics Society), which will be held from March 27-28, 2014 in Potsdam, Germany.

## Feb. 27th 2014: Paper accepted by German Informatics Society.

The paper "A High-Performance Hardware Accelerator for HEVC Motion Compensation" by Matthias Göbel has been accepted for the proceedings of the "Informatiktage 2014" conference by GI (German Informatics Society) which will be held from March 27-28, 2014 in Potsdam, Germany.

## Feb 27th, 2014: Polyphone Gitarrenakkorderkennung - Martin Schürer.

This bachelor thesis deals with a branch of the automatic music transcription with regard to the recognition of polyphonic guitar chords. Since the process of transcription is very time consuming, an automated solution provides a relief for the exchange of songs and melodies between musicians, as well as support for teaching practices. This work presents an algorithm that analyses polyphonic tones contained in a single-track guitar signal, in order to define accompanying chords. The algorithm was implemented in MATLAB and is based on frequency analysis. So that a chord, which is contained in a signal, can be identified by spectral analysis. In this case the analyzed note can be an individually played string or a chord with up to six simultaneously played strings. Furthermore, an onset-detection was implemented, which recognizes the beginning and the end of a note. The algorithm is verified by using synthetically generated and self-recorded chords. The aim of this algorithm is the correct detection of individually played strings, chords with minor complexity (with two to three simultaneously played strings) as well as more complex ones (up to six simultaneously played strings). Based on the test cases, which have been used, the accuracy of note recognition is approximately 94 percent. The proper detection of a single strummed string requires an average time of 104 ms. For simple chords, the algorithm requires 202 ms and for complex chords 244 ms on average. The onset-detection is verified by a test riff and achieves a recognition rate of 93 percent.

## Feb. 10th, 2014: Paper accepted at GLSVLSI'14.

The paper "A Generic Implementation of a Quantified Predictor for FPGAs" by Gervin Thomas, Ahmed Elhossini and Ben Juurlink has been accepted for oral presentation at the 24th edition of GLSVLSI in Houston, Texas, USA, May 21-23. More information about GLSVLSI'14 can be found at http://www.glsvlsi.org/.

## Gesucht: Studentische Hilfskraft mit 41 Monatsstunden.

Kennziffer: 3434 T 16/14
Bewerbungsfristende: 24.02.2014
Einstellungsdauer: voraussichtlich vom 01.04.2014 bis zum 31.03.2016

Aufgabengebiet:
Mitarbeit in der Lehre im Bachelor-Studium. Betreuung und Vorbereitung für folgende Lehrveranstaltungen: TechGI1: Digitale Systeme, TechGI2/TechGI2TI: Rechnerorganisation, Hardware Praktikum

Anforderungen:
Bachelor Technische Informatik oder Informatik, Abschluss des 3. Semesters und Modulabschlüsse in TechGI1 und TechGI2 bzw. äquivalente Abschlüsse, gute Englisch - und VHDL-Kenntnisse, Bereitschaft zur Einarbeitung in neue Themengebiete

Ihre schriftliche Bewerbung mit Lebenslauf, Immatrikulationsbescheinigung und ggf. aktueller Notenübersicht richten Sie bitte an:

Technische Universität Berlin
Fakultät IV - Elektrotechnik und Informatik
Institut für Technische Informatik und Mikroelektronik (TIME)
Fachgebiet Architektur eingebetteter Systeme (AES)
Sekretariat EN-12
Einsteinufer 17
10587 Berlin

e-Mail:

## Feb 13th, 2014-10:30am-Interfacing an Embedded Vision System to Embedded Linux on a Zynq FPGA - Konstantinos Gavriilidis.

Real-time stereo vision systems are deployed by several computer engineering applications, from intelligent robotics to automated systems. Stereo vision is an intensive computing task, as it requires to compute the correlation between a pair of stereo frames. In order to achieve a fast correlation calculation which provides acceptable results in real-time, an efficient system design needs to be developed. In this thesis, a hardware-software co-design approach is presented, which makes use of the synergy of hardware and software. An already existing real-time vision system was extended, to calculate the depth of a recorded object. This vision system consists of a Zynq SoC which obtains the stream of a VmodCAM module. The work conducted for this thesis can be divided in three parts. First, a Linux driver was implemented to enhance the software capabilities of the Zynq. Then, the hardware configuration of the Zynq was extended to support stereo streaming. Finally, a stereo depth calculation algorithm, which makes use of the driver to obtain a stereo stream in real-time, was implemented in software. The proposed platform design and algorithm were evaluated in terms of accuracy and performance and several enhancements are proposed to improve both of them.

## Feb 13th, 2014-10:00am- Statisches Scheduling auf einem echtzeitbasiertem Multi-Core System- Mirko Liebender.

In this thesis we consider the possibilities for implementing a static scheduler on safety critical multi-core real-time systems. Static schedulers are mainly used in domains like automotive and avionics. Among other things they can be used to guarantee hard real-time on modern multi-core processors.

In the course of this thesis a simple static scheduler for the operating system VxWorks 6.9 was implemented without changing the kernel itself. Therefore the static scheduler is located one layer above the operating systems scheduler. The concept of that scheduler enables the usage of any correct static scheduling plan without needing to adjust the code of the custom programs that it runs. In multiple experiments on a QorIQ P4080 machine the functionality, correctness and possible overhead of the implemented static scheduler was measured and evaluated. A correct and punctual execution could be proven and a small overhead of the scheduling routine was detected. The reinitialization of each process after being finished is however more or less unpredictable and disqualifies the static scheduler for safety critical usage without any further insight on the operation system code. In this thesis a working static scheduler for VxWorks 6.9 could be realized with a few limitations concerning predictability.

## Feb 10th, 2014: Mauricio Álvarez-Mesa gives an invited talk at University of Castilla-La-Mancha, Albacete, Spain.

Dr. Mauricio Álvarez-Mesa will give an invited talk at the Department of Computer Science of University of Castilla-La-Mancha, in Albacete, Spain on Monday February 10th. The title of the talk is: "Parallel Video Decoding: Experiences with H.264 and HEVC". In this talk Dr Álvarez-Mesa will present the latest results of the AES research group on video decoding using parallel architectures. Dr Álvarez-Mesa also will have a meeting with researchers from the Computer Architecture and Technology group at the Albacete Research Institute of informatics (I3A) about possible joint projects.

## Gesucht: Wiss. Mitarbeiter(in) -Entgeltgruppe 13 TV-L Berliner Hochschulen für max. 5 Jahre (zur Promotion)

Kennziffer: IV - 13/14 (besetzbar ab 01.03.2014/ Bewerbungsfristende 23.02.2014)

## Jan. 15th 2014. AES group at HiPEAC 2014 Conference.

A delegation of the AES group will participate in the HiPEAC 2014 conference which will be held in Vienna from January 20th to January 22nd 2014. AES participation includes three presentations at the LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014). The workshop is organized by members of the LPGPU European project, including the AES group from TUB. The presentations from the AES group are: "Power and Energy Efficiency of Video Decoding on Multi-core Architectures" by Chi Ching Chi, "DART: A Decoupled Architecture Exploiting Temporal SIMD" by Jan Lucas, and "Parallel H.264/AVC Motion Compensation for GPUs using OpenCL" by Biao Wang (who got a PhD student registration grant from the HiPEAC 2014 organizing committee). More information about the PEGPUM workshop can be found at lpgpu.org/wp/pegpum-2014/ and about the HiPEAC conference at www.hipeac.net/conference/vienna.

## Jan. 16, 2014-10:40am- Power and Energy Efficiency of Video Decoding on Multi-core Architectures- Chi Ching Chi.

In this talk we present how modern power states influence the power consumption of realtime HEVC video decoding. In realtime applications a set amount of operations has to be performed within a time frame, allowing the processor to go idle when the task has been performed. Processor architectures and offchip memory have incorporated many low power states, which allow the processor to consume less energy at lower activity levels. On x86 processors this has resulted primarily in so called P-States and C-states, which control the power consumption when active and idle, respectively. Each of these states have different transition times, power consumption, and performance level, introducing a new problem of choosing when to use which state. In research, conflicting strategies such as ?race to idle? and running longer at lower clock has been proposed as the best solution. Evaluation has been performed to for finding which technique is better for HEVC decoding. Analysis has been performed on different systems ranging from desktop to ultra-mobile platforms.

## Jan. 16, 2014-10:20am- DART: A Decoupled Architecture Exploiting Temporal SIMD- Jan Lucas.

GPUs can offer very high performance and good energy efficiency on some applications. Many applications, however, do not perform well. The high area and energy efficiency is reached by grouping threads into groups called warps and running the threads from one warp in lockstep. This way with only one instruction per warp fetched, decoded and issued up to warp length operations can be executed. In conventional GPU implementations spatial SIMD units are used to execute warps. This results in underutilization of the execution units, if the threads from one warp are following different control flow paths(branch divergence). This talk presents DART, a novel architecture for GPUs based on Temporal SIMD. By using a temporal implementation of SIMD it can offer better utilization of the execution units with branch divergence. The details of the DART architecture will be explained and benchmark results comparing DART GPU and conventional GPUs will be presented.

## Jan. 16, 2014-10:00am- Parallel H.264/AVC Motion Compensation for GPUs using OpenCL- Biao Wang.

Motion Compensation (MC) is one of the most compute-intensive parts in H.264/AVC video decoding. It exposes massive parallelism which can reap the benefit from Graphics Processing Units (GPUs). However, the divergence caused by different interpolation modes in MC can lead to significant performance penalty on GPUs. In this work, we propose a novel multi-stage approach to parallelize the MC kernel for GPUs using OpenCL. The proposed approach mitigates the divergence by exploiting the fact that different interpolation modes share common computation stages. In addition, the optimized kernel has been integrated into a ffmpeg decoder that supports H.264/AVC high profile. We evaluated our kernel on GPUs with different architectures shipped by AMD, Intel, and Nvidia. Compared to a CPU implementation, our kernel achieves maximum speedups of 3.27 and 3.59 for 1080p and 2160p videos, respectively. Furthermore, we applied zero copy optimization for integrated GPUs from AMD and Intel to eliminate memory copy overhead between CPU and GPU.

## Jan. 7, 2014: Paper accepted at MULTIPROG.

The paper "Considering Quality-of-Service for Resource Reduction using OpenMP" has been accepted for presentation at the Seventh Workshop on
Programmability Issues for Heterogeneous Multicores (MULTIPROG-2014) to be held in conjunction with the 9th International Conference on
High-Performance and Embedded Architectures and Compilers (HiPEAC) in Vienna, Austria on January 22, 2014. The paper is the result of collaboration between Artur Podobas, Mats Brorsson and Vladimir Vlassov from KTH in Stockholm, Sweden, and Chi Ching Chi and Ben Juurlink from the AES group of TU Berlin. More information about the workshop can be found at http://multiprog.ac.upc.edu.

## Jan. 2, 2014- AES publicizes their H264 OpenCL decoder.

To employ the power of GPUs for massive parallel processing, this work offloads parallel kernels in H.264 decoding, namely inverse transform and motion compensation, onto GPUs. At kernel level, significant speedup is observed compared to an highly optimized CPU SIMD implementation.

.

## Dec. 19, 2013- 10:30h, EN 642: Performance portability for low-power embedded GPUs- Guilherme Calandrini.

The GPUs is already a common part of embedded system design, in addition the OpenCL standard already promises a vender neutral programming solution for CPUs, GPUs and DSPs, these advances are well sponsored by portable devices such as smart phones or tablets which have stimulated the emergence of several distinct low-power embedded GPU architecture. This presentation shows the results of an analysis made about performance and power efficiency of several GPUs architectures using a suite of OpenCL micro benchmarks. The results are a key for understanding the performance and power implication of optimization strategies for the different GPU architectures as well a guide for selecting appropriate GPUs according to performance or energy efficiency requirements.

## Dec. 19, 2013- 10h, EN 642: Device Tree-Erweiterung für QEMU- Tim Barnewski.

I will show how to add configurable hardware support using Device Trees for the Versatile Platform Baseboard family of single-board computers to QEMU. To achieve this, I will first provide a more detailed insight into the structure and purpose of Device Trees, as well as some of the inner workings of QEMU that are relevant to this goal. Additionally, an overview of the implementation itself will be provided, and it will cover any surprises, problems or other findings encountered during the process.

## Dec. 17, 2013: Visit to Technische Universität Dresden.

The PhD guest Guilherme Calandrini made a visit to TU Dresden for give a presentation entitled as "Performance Portability and Energy Issues in Computing Architectures" to the group of Operating Systems and Security of Prof. Dr. Hermann Härtig in the faculty of Computer Science, the group has a strong background of high level system development, such as Linux Kernel and virtualization, the visit aimed to present the issues in the development of energy efficiency applications that must handle with different layers of computing architecture (from circuit level, architecture design, operating system and why not the virtual machine). During the visit, he also had the opportunity to make known the LPGPU project to Prof. Dr. Emil Matus from the Vodafone Chair Mobile Communications Systems that works in a heterogeneous SoC for communication applications.

For further information about the talk, see os.inf.tu-dresden.de/EZAG/abstracts/abstract_20131217.xml

## Dec 5, 2013- 10h, EN 642: Multiprotokollfähige Master für Ethernet-basierte Feldbusse - Victor Kozhukhov

Moderne CNC-Steuerungen verwenden spezielle auf Ethernet basierte Protokolle, um die Ansteuerung der Slave-Geäte in Echtzeit zu ermöglichen. Eine CNC-Steuerung von Schleicher Electronic, die bereits in der Lage ist als ein Sercos-III-Master zu operieren, wird um die Funktionalität eines EtherCAT-Masters erweitert. Die Nutzung beider Protokolle soll über die gleichen Ethernet-Anschlüsse der CNC-Steuerung möglich sein. Der Benutzer soll selbstständig entscheiden können, ob die CNC-Steuerung als ein Sercos-III-Master oder ein EtherCAT-Master eingesetzt werden muss. Der Wechsel des Protokolls soll dabei mit einem möglichst geringen Aufwand stattfinden. Dabei sind vor allem Änderungen der Hardware (mit der Ausnahme der Inhalte der programmierbaren Logik) zu vermeiden. Die CNC-Steuerung verwendet einen speziell für Sercos-III-Master optimierten Dual-MAC. Der Dual-Mac des Sercos-III-Masters wird beim Starten der Software als IP-Core auf ein in die CNC-Steuerung integriertes FPGA geladen. Um die Nutzung der CNC-Steuerung als ein EtherCAT-Master zu ermöglichen, wird ein geeignetes Dual-MAC entwickelt. Somit kann beim Starten der Software entschieden werden, ob der Dual-MAC für das Sercos-III-Master oder der Dual-MAC für das EtherCAT-Master auf das FPGA geladen wird. Der Dual-PHY, der in die CNC-Steuerung integriert und mit dem FPGA verbunden ist, ist für beide Protokolle geeignet.

Es wird zusätzlich eine Anpassung der Software benötigt, damit die CNC-Steuerung in der Lage ist als ein EtherCAT-Master zu operieren. Für den Aufbau des EtherCAT-Master-Protokollstacks wird ein EtherCAT-Master-High-Level-Treiber von Acontis Technologies eingesetzt. Der EtherCAT-Master-High-Level-Treiber steht dabei in Form einer vorkompilierten Library zur Verfügung. Die anwendungsspezifische Software ist in der Lage den High-Level-Treiber einzubinden und über eine entsprechende API zu verwenden. Für den Einsatz des High-Level-Treibers, zusammen mit dem selbstständig für den EtherCAT-Master entwickelten Dual-MAC, wird eine Ethernet-Hardwareabstraktionsschicht implementiert.

## Dec 3, 2013: Vice chancellor of Politecnico di Milano visits AES group.

On Tuesday December 3, the vice chancellor of TU Berlin's partner university Politecnico di Milano, Prof. Donatella Sciuto, will visit the AES group. Mrs. Sciuto is a full professor in Computer Engineering at the Dipartimento di Elettronica e Informazione of the Politecnico di Milano. She is Deputy Director of Education at CEFRIEL where she manages the executive companies education training programs. For more information about Prof. Sciuto and her research interests, visit her website at  here.

## Dec 2, 2013, 13h: Crown Scheduling: Energy-Efficient Resource Allocation, Mapping and Discrete Frequency Scaling for Collections of Malleable Streaming Tasks- Prof. Dr. Christoph Kessler.

Time: 1:00 PM, 2 December 2013
Place: 4.064, MAR Building

Abstract:
We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable or malleable tasks on a generic manycore processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin at user level in a data driven way. A stream of data flows through the tasks and intermediate results are forwarded on-chip to other tasks.
In this presentation we introduce Crown Scheduling, a novel technique for the combined optimization of resource allocation, mapping and discrete voltage/frequency scaling for malleable streaming task sets in order to optimize energy efficiency given a throughput constraint. We present optimal off-line algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). Our energy model considers both static idle power and dynamic power consumption of the processor cores.
Our experimental evaluation of the ILP models for a generic manycore architecture shows that at least for small and medium sized task sets even the integrated variant of crown scheduling can be solved to optimality by a state-of-the-art ILP solver within a few seconds. -
We conclude with a short outlook to the new EU FP7 project EXCESS (Execution Models for Energy-Efficient Computing Systems).

Acknowledgements:
This is joint work with Nicolas Melot (Linköping University), Patrick Eitschberger and Jörg Keller (FernUniv. in Hagen, Germany). Partly funded by VR, SeRC, and CUGS.
Based on our recent paper with the same title at Int. Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS-2013), Sep. 2013, Karlsruhe, Germany.

Short Biography:
Christoph W. Kessler (german spelling: Keßler) is a professor for Computer Science at Linköping University, Sweden, where he leads the Programming Environment Laboratory's research group on compiler technology and parallel computing. Christoph Kessler received a PhD degree in Computer Science in 1994 from the University of Saarbrücken, Germany, and a Habilitation degree in 2001 from the University of Trier, Germany.
In 2001 he joined Linköping university, Sweden, as associate professor at the programming environments lab (PELAB) of the computer science department (IDA).
In 2007 he was appointed full professor at Linköping university. His research interests include parallel programming, compiler technology, code generation, optimization algorithms, and software composition. He has published two books, several book chapters and more than 90 scientific papers in international journals and conferences. His contributions include e.g. the OPTIMIST retargetable optimizing integrated code generator for VLIW and DSP processors, the PARAMAT approach to pattern-based automatic parallelization, the concept of performance-aware parallel components for optimized composition, the PEPPHER component model and composition tool for heterogeneous multicore/manycore based systems, the SkePU library of tunable generic components for GPU-based systems, and the parallel programming languages Fork and NestStep.

## 27-28 Nov. 2013: Ben Juurlink gives keynote presentation at ICT.OPEN 2013.

Ben Juurlink has been invited to give a keynote presentation in the Embedded Systems track of ICT.OPEN 2013. ICT.OPEN is the principal ICT research conference in the Netherlands and is held on 27-28 November in Eindhoven. The title of his talk is "Lessons Learnt From Parallelizing Video Decoding". More information about the conference can be found at www.ictopen2013.nl/content/speakers.

## 21.11.13, 10h, EN 642: Manycore Agent-Oriented Programming (MAOP)- Silvano Menk and Robert Hering

In our presentation we want to give a short overview of our bachelor thesis. Therefore we will briefly discuss the current state of parallel programming with special focus on manycore architectures. From this we will deduce our idea for a supposedly intuitive and efficient programming model for manycore architectures, which will be the subject of our thesis. Finally we will propose a coarse working plan and hope for some initial feedback and suggestions.

## November 18-19 2013. Fusing GPU Kernels at HiPEAC Compiler, Architecture and Tools Conference.

A presentation based on a research work, that has been undertaken by Codeplay and TU Berlin’s AES group as part of the LPGPU project, will be presented at this year’s HiPEAC Compiler, Architecture and Tools Conference in Haifa, Israel. The talk is titled “Fusing GPU kernels within a novel single-source C++ API” and will be presented by Paul Keir from Codeplay.

Abstract of the talk:

The prospect of GPU kernel fusion is often described in research papers as a standalone command-line tool. Such a tool adopts a usage pattern wherein a user isolates, or annotates, an ordered set of kernels. Given such OpenCL C kernels as input, the tool would output a single kernel, which performs similar calculations, hence minimising costly runtime intermediate load and store operations. Such a mode of operation is, however, a departure from normality for many developers, and is mainly of academic interest.

Automatic compiler-based kernel fusion could provide a vast improvement to the end-user's development experience. The OpenCL Host API, however, does not provide a means to specify opportunities for kernel fusion to the compiler. Ongoing and rapidly maturing compiler and runtime research, led by Codeplay within the LPGPU EU FP7 project, aims to provide a higher-level, single-source, industry-focused C++-based interface to OpenCL. Along with LPGPU's AES group from TU Berlin, we have now also investigated opportunities for kernel fusion within this new framework; utilising features from C++11 including lambda functions; variadic templates; and lazy evaluation using std::bind expressions.

While pixel-to-pixel tranformations are interesting in this context, insomuch as they demonstrate the expressivity of this new single-source C++ framework, we also consider fusing transformations which utilise synchronisation within workgroups. Hence convolutions, utilising halos; and the use of the GPU's local shared memory are also explored.

A perennial problem has therefore been restructured to accommodate a modern C++-based expression of kernel fusion. Kernel fusion thus becomes an integrated component of an extended C++ compiler and runtime.

## Nov. 17-22, 2013: Mr. Tamer Dallou is presenting a paper at MTAGS - SC 2013

Mr. Tamer Dallou is attending "The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2013)", to present a paper at the "6th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS 2013)". The paper title is "FPGA-Based Prototype of Nexus++ Task Manager", and presents the recent VHDL design and evaluation of the Nexus++, our hardware task graph manager for task-based programming models. SC 2013 is a principal HPC conference world wide, and takes place in Denver, Co, USA on Nov. 17-22, 2013.

## 7.11.13, 10:30h, EN 642: Automatic Code Generation for a Microblaze system with ARM NEON SIMD Acceleration - Ilias Timon Poulakis

SIMD (Single instruction, Multiple data) accelerators are increasingly deployed in modern CPU architectures. These units can efficiently process certain data, e.g. mulimedia formats, improving CPU performance and energy consumption. The research department Embedded Systems Architecture(AES) of the Berlin Institute of Technology currently utilizes the Microblaze processor by XILINX, which does not sup- port SIMD acceleration natively. Hence, an ARM NEON compatible SIMD accelerator has been attached to the Microblaze processor. The two units communicate through a protocol based on FSL (Fast Simplex Link). To efficiently use this peculiar architecture, automatic code generation is needed. Yet, creating a custom compiler is difficult and utterly time-consuming. In order to avoid this route, this thesis presents an alternate approach in which merely existing compiler backends are used. The main idea is to create machine code for both Microblaze and ARM NEON separately, using their respective existing compiler backends. Code sections executable by ARM NEON have to be located, then be appropriately inserted into the Microblaze code. In the wake of this thesis, a tool that performs these tasks has been successfully imple- mented, tested and evaluated. This paper focuses on the realization steps taken.  The capabilities of the implemented tool are discussed, and an outlook is given on how the approach could be utilized for a different combination of processor and SIMD accelerator.

## 7.11.13, 10h, EN 642: Instruction compression for the synZEN architecture- Tammo Johannes Herbert

This thesis is focused on the compression of the synZEN architecture's command. The synZen architecture is a parallel architecture following the MIMD (multiple instruction, multiple data) principle. Parallelism is achieved on instruction level (ILP - instruction level parallelism), the concept of VLIW (Very Long Instruction Word) is being utilized for that. A VLIW may contain NOPs (No Operation). The compression method presented here makes use of the possibility to remove those NOPs. Thus, they are not being saved repeatedly in the instruction memory. Saving of removable NOPs is done once in a central place. NOP instructions in the VLIWs may be saved by taking advantage of this redundancy. Therefore, the average size of the VLIWs may be reduced. This requires two substeps. First, the opcope needs to be compressed at compile time. This step is implemented purely in software. Second, after the compression is done, the hardware needs to be altered accordingly for decompression. The decompression happens at runtime. Therefore, the resource utilization of the hardware is faced with the degree of compression in the software. In the concluding evaluation, the required balance for a preferably high effectivity is described.

## 29–30 October 2013: Ben Juurlink @ Cyber-Physical Systems: Uplifting Europe's Innovation Capacity.

Ben Juurlink is currently visiting this two-day event in Brussels which is devoted to explore the innovation potential of Cyber-Physical Systems (CPS). This event is organized by the European Commission and discusses how EU Research and Innovation Programmes can stimulate the creation of new industrial platforms led by EU-actors and facilitate the matchmaking between future user/customer needs and technology offers. For more information, see http://www.amiando.com/cps-conference.html.

## October 14, 2013: Prof. Juurlink is a member of the PhD defense committee of Yifan He.

Prof. Juurlink is visiting Eindhoven, NL, where he is a member of the PhD defense committee of Yifan He.

Yifan He defends his dissertation entitled "Low Power Architectures for Streaming Applications".

## 24.10.13, 11h, EN 642: Design and Implementation of a high-throughput CABAC Hardware Accelerator for the HEVC Decoder- Philipp Habermann.

HEVC is the new video coding standard of the Joint Collaborative Team on Video Coding. As in its predecessor H.264/AVC, Context-based Adaptive Binary Arithmetic Coding (CABAC) is a throughput bottleneck. Due to strong low-level data dependencies, there is only a very small amount of data level parallelism that can be exploited by using the SIMD extensions of current computer architectures. A high-level parallelization is possible in HEVC, but not mandatory. That is why another optimization strategy has to be developed that can be used independently from the input video. Attention was paid for throughput improvements during the standardization of HEVC to address this issue. The goal of this thesis is to evaluate the hardware acceleration opportunities for the highly sequential HEVC CABAC by exploiting the throughput improvements. The evaluation is limited to transform coefficient decoding, as it is the most time consuming part of CABAC. The hardware accelerator is implemented on the Digilent ZedBoard, a development board that contains a 667 MHz ARM Cortex-A9 processor together with a closely coupled FPGA and thereby allows efficient hardware-software co-design. The implemented hardware accelerator processes 70 Mbins/s at 75.36 MHz and achieves an 11× speed-up over software transform coefficient decoding for a typical workload. The hardware accelerator has also been integrated in a complete HEVC software decoder but due to the current slow hardware-software interface, the overall speed-up is relatively small. However, as the data transfer between hardware and software can be significantly reduced when a full CABAC hardware accelerator is implemented, this is a promising path to pursue in future work.

## 24.10.13, 10h, EN 642: Design and Implementation of a Hardware Accelerator for HEVC Motion Compensation- Matthias Goebel.

This master thesis focuses on the design and implementation of a motion compensation hardware accelerator for use in HEVC hybrid decoders, i.e. decoders that contain hard- ware as well as software parts. The motion compensation part of the decoding process is especially suited for such an approach as it is the most time consuming part of pure software decoders. Support for high resolutions and frame rates should be combined by the hardware accelerator with a very low demand for resources and power. An optimized software decoder compatible to the reference decoder has been used as a starting point. As a platform the Zynq-7000 All Programmable SoC by Xilinx is used which combines an ARM Cortex-A9 dual-core CPU running at 667 MHz with flexible programmable logic resources similar to those used in FPGAs. After giving some background information on the involved topics a discussion of the design space with a special focus on the level of granularity, the degree of parallelization and the memory access is performed. For the granularity the PU level has been chosen as it offers a good trade-off between performance and complexity. The resulting design is further highlighted and a prototype implemented and validated. For validation the Foreign Language Interface (FLI) of Mentor Graphics’ ModelSim HDL simulator has been used. As an evaluation of the prototype shows promising results, two different memory interfaces (including one using DMA) are added and the complete accelerator integrated into a Zynq-7000 environment. The necessary modifications to the software decoder for both interfaces are discussed and partially performed. A final evaluation shows an expected frame rate of 4.14 FPS for the complete 1080p decoding process when running the accelerator at 100 MHz.

## 23.10.13: The paper "Considering Quality-of-Service for Resource Reduction using OpenMP" in the MCC13.

The paper "Considering Quality-of-Service for Resource Reduction using OpenMP" has been accepted for Oral presentation at the 6th Swedish Workshop on Multicore Computing and to be included in the workshop proceedings. The workshop will be held at Halmstad University in Halmstad, Sweden (November 25-26, 2013) .

## 15.10.13: The paper "FPGA-Based Prototype of Nexus++ Task Manager" to appear in MTAGS 2013.

The paper "FPGA-Based Prototype of Nexus++ Task Manager", by Tamer Dallou, Ahmed Elhossini and Ben Juurlink, is accepted to appear at the 6th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers, which is Co-located with Supercomputing/SC 2013, on November 17th, 2013, Denver, Colorado, USA.

## AES in California: An NVIDIAn returns.

Tuesday, 01. October 2013

As of October 1st, the graduate student Michael Andersch has re-joined AES research. Michael had spent his summer in Santa Clara, California, where he was employed by NVIDIA during a summer internship. As an intern, Michael worked in the GPU Compute Architecture team, building tools and architecture designs to analyze and improve compute application performance on NVIDIA's next-generation GPU designs. Welcome back, Michael!

## Our new employee Philipp Habermann.

Monday, 28. October 2013

We are pleased to welcome Philip Habermann as a new member of our group. He will contribute to the AES-team in research and teaching. Welcome!

## "Recent Advances in Computer Architecture" will take place in room EN 630!

Attention! Room Change! The Course "Recent Advances in Computer Architecture" (0433 L 334) will take place on Tuesdays from 10:00 to 12:00 in room EN 630.

## The Lab excercises for Multicore Architectures will take place in room TEL 206 Li!

Attention! Room Change!

The Lab excercises for the Multicore Architectures course (LV 0433 L 333) will take place on Mondays from 14:00 to 16:00 in room TEL 206 Li.

## The Kurs Computer Arithmetics, Multicore Architectures and Recent Advances in Computer Architectures starts a week later.

The following courses will start a week later (from the October 21, 2013):

- Computer Arithmetics: A Circuit Perspectiv,

- Multicore Architecture and

- Recent Advances in Computer Architecture.

## 10.10.13, 10h, EN 642: Enhancing Cache Organization for Tiled CMP Architectures - Tareq Alawneh.

Many-core processors architecture has become very common nowadays with the leading CPU manufactures (Intel, AMD, and TILERA) focusing on tiled CMP architectures. Our target system assumes a tiled CMP architecture consists of n-core interconnected with 2D mesh switched network. Each tile has a processor core, a private L1-D/I cache, private L2 cache, and router for on-chip data transfers. Each cache block has a home tile which maintains the directory information for that block- the directory keeps track of tiles with copies for that block. On occurring miss in the private L1 and L2 caches respectively, it will request it from home tile. In case of miss happened, it will be handled depending on its specific coherent protocol implementation. The drawback of this design is the possibility of overloading some home tiles with the remote requests which creates a scalability bottleneck. Furthermore, as the processor count increases the L2 miss cache access latency will be dominant by the number of message hops to reach the particular cache rather than the time spent to access the cache itself. These drawbacks can be mitigated when taking into account other access patterns of the data.

In this study, we analyze this problem and propose ways to alleviate its impact on the system performance. One way to improve the system performance of the tiled CMP architecture is to access the L2 cache banks of the adjacent tiles to fetch the requested code cache lines before accessing its assigned home tiles. Realizing such mechanism will reduce the L2 remote cache latency, since the requested code cache lines may be fetched them from L2 caches of nearby tiles instead of L2 caches of its home tiles. Furthermore, the number of accesses for the home tiles will be reduced. These two contributions of our proposed study will be certainly reflected in the improvement of the system performance as a consequence of expected reduction of network utilization and AMAT.

As future work, we propose another way to improve the tiled CMP architectures by migrating hot cache lines closer to requesting tiles.

## October, 6-10: Prof. Juurlink to ICCD conference.

From October 6 to October 10 Prof. Juurlink will visit the IEEE International Conference on Computer Design in Asheville, North Carolina, USA.

He is the chairman of the Processor Architecture track and will also chair the session on Efficient Cache Architectures.

## Sept. 30th 2013. A delegation from Hunan University, China visits AES TU-Berlin.

A delegation from the University of Hunan (one of the oldest and most important national universities in China) will visit the AES group of TU-Berlin on September 30th. They will be introduced to the research activities of the AES group, and discuss opportunities for joint research work. The delegation is composed by 7 faculty members from the School of Computer and Communication lead by professor Renfa Li.

## 13.09.13: AES TU Berlin presents 4k UHD HEVC/H.265 decoding.

The AES group is proud to present its highly efficient 4k Ultra HD capable MPEG-HEVC/H.265 decoder setup. A demo setup is created with a 65 inch Samsung UHD TV and a custom mini PC based on the 4th generation Intel Core processor. Optimization for the latest generation processors allow the compact setup to decode UHD faster than 60 fps even at higher bit depths with no more than two threads.

## 26.09.2013, 10h, EN 642: A Cost-Effective Kite State Estimator for Reliable Automatic Control of Kites​- Johannes Peschel

Airborne Wind Energy (AWE) is a developing technology that uses tethered wings to harvest wind energy and convert it into electrical energy. Most of the AWE concepts that will be presented in this thesis have one common challenge: Estimating the position and the orientation of the kite, also called kite state, especially during highly dynamic flight situations. The focus of this thesis is first, to investigate, if angular sensors are feasible to obtain reliable position data and second, which fusion algorithm can be used to join the data of the angular and Global Navigation Satellite System (GNSS) sensors. The TU Delft prototype is a suitable testing platform for this purpose. The author added angular sensors to the ground station of the TU Delft AWE system that measure the elevation and the horizontal displacement of the tether holding the kite. They are mounted on a modular stainless steel construction, which has low wear and a long lifetime. The author used the tether length and the angular data to obtain a new position. This position was merged with the two GNSS positions that were already attached to the kite. The angular sensors were able to measure with a resolution of <0.01°. The elevation and azimuth position of the kite had an error of less than 0.7° as long as the tether force was higher than 2000N. One of the GNSS sensors provided reliable data during low force phases. A reliable position in all flight conditions could be obtained by using double exponential smoothing prediction to merge both positions. This development enables the implementation of a reliable kite power control system.

## Sept. 22-27, 2013: Prof. Juurlink to ScalPerf workshop.

Prof. Juurlink has been invited to give a presentation at the ScalPerf (Scalable Approaches to High Performance and High Productivity Computing) workshop which will be held in Bertinoro, Italy from Sept. 22 to Sept. 27, 2013. There he will present his recent article "Amdahl's law for predicting the future of multicores considered harmful". For more information about the workshop see http://www.dei.unipd.it/~versacif/scalperf13/index.html. The article can be accessed via ACM Digital Library http://doi.acm.org/10.1145/2234336.2234338.

## September 15, 2013: "HiPEAC grant: Performance portability for low-power embedded GPUs"

The AES group of TU Berlin has received a collaboration grant from HiPEAC for a three month visit of Guilherme Calandrini, a PhD student from the University of Alcala in Spain. The visit will focus on performance portability for low-power embedded GPUs using OpenCL. In this collaboration we aim to create a set of OpenCL benchmarks that can be used to compare the performance and power efficiency of different embedded low-power GPUs. The results of this research will be very useful for understanding the performance and power implications of optimization strategies for different GPU architectures; and also selecting the most appropriate GPUs based on well defined quantitative performance and power metrics.

## 11.09.13: Best paper award at the 3rd IEEE 2013 ICCE-Berlin.

Mauricio Alvarez-Mesa, Chi Ching Chi and Ben Juurlink of the AES group of TU Berlin have won a best paper award at the Third IEEE International Conference on Consumer Electronics-Berlin (ICCE-Berlin) for the paper "HEVC Performance and Complexity for 4K Video". The paper was a joint effort between the AES group of TU Berlin and Fraunhofer HHI.

## September 9, 2013: The AES group will host Mr. Hasan Hassan.

The AES group will host Mr. Hasan Hassan, who is a student at TOBB University of Economics and Technology (http://etu.edu.tr/en) as an intern to work on porting computer vision algorithms to GPUs using OpenCL. The internship will be organized within the framework of the EU Erasmus Programme and will take place between 09-09-2013 and 20-12-2013. We aim to develop several kernels that are used in various computer vision algorithms, with high demand of parallel computation, to the GPU world, which provides a high level of parallel processing.

## September 7, 2013: Prof. Dr. Ben Juurlink in the MuCoCoS-2013

The Paper "Topology-aware Equipartitioning with Coscheduling on Multicore Systems" by Jan H. Schönherr, Ben Juurlink and Jan Richling will be presented in the 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS-2013), which will be held on September 7 in Edinburgh, Scotland, UK, in conjunction with the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT 2013).

MuCoCoS-2013 focuses on language level, system software and architectural solutions for performance portability across different architectures and for automated performance tuning.

## Aug 29th. 2013. AES paper in the SPIE Applications of Digital Image Processing Conference.

The paper "HEVC real-time decoding" by Mauricio Alvarez-Mesa, Chi Ching Chi and Ben Juurlink of the AES group of TU-Berlin has been presented at the SPIE Applications of Digital Image Processing Conference that was help in San Diego, USA, from August 25 to August 29 2013. The paper was a joint effort between the AES group of TU Berlin and Fraunhofer HHI.

## Juli 2013: "best-in-class-award" to Philip Habermann.

Recently the "best-in-class-award" was handed out by Prof. Juurlink to Philip Habermann for the class "Advanced Computer Architecture 2011-2012". The "best-in-class-award" is awarded to the best performing student who achieves the highest grade in Prof. Juurlink's master courses. This year it will also be awarded.

## 14-20 july, 2013: Sohan Lal and Jan Lucas from TU Berlin are going to present two posters at HiPEAC ACACES 2013.

Sohan Lal and Jan Lucas from TU Berlin are going to present two posters at HiPEAC ACACES 2013. The two posters will present some of the most recent research results from the project to the public for the first time.

• The poster “Exploring GPGPUs Workload Characteristics and Power Consumption” by Lal et al. will provide interesting insights into the power consumption of GPU workloads and how they are related to the performance characteristics of the workloads.
• The poster “DART: A GPU architecture exploiting temporal SIMD for divergent workloads” will present first simulation results for DART, an new GPU architecture developed within the LPGPU consortium by Lucas et al.

## 8.07.2013, 13h, EN 642: Implementation and Evaluation of Large Warps in GPUSimPow - Matthias Stroux

Graphic processors (GPU) are a special class of parallel processors for massive parallel programs. From additional processors enhancing graphics intensive programs they have developed into general purpose computing devices for high-performance business and scientific computing. GPU’s typically handle branches by sequentializing the branch paths, which leads to underutilization of their SIMD execution units. Large-Warps (LW) is a concept to increase utilization by selecting threads with the same execution path, PC and program-state, from larger units, called ’Large-Warps’ into temporary units of execution, the ’Sub-Warps’. LW should therefore lead for some programs to a significant increase of SIMD-utilization. Theoretical considerations also show that for some programs there should be an increase of IPC or a decrease of execution cycles possible by a factor of more than two. To test this concept with real programs and in a complete system, where memory latency and network effects can be taken into account and measured, Large-Warps was implemented in a software simulator for GPU’s, GPGPU-Sim 3.x and the new power-simulator GPUSimPow for power effects. Results for ideally constructed synthetic benchmarks show the expected effects: where functional execution of SIMD-units can be increased, IPC increases to. However memory effects and effects of other system parts have to be taken into account to. For a number of ’real-world’ benchmarks the positive effect of Large-Warps on performance (IPC) can be confirmed.

## June 12th 2013: A group of computer engineering students from Universidad del Valle (Colombia) visit AES TU-Berlin

A group of computer engineering students accompanied by professor Dr. Maria Trujillo from Universidad del Valle of the Colombian city of Cali will visit TU-Berlin on June 12th 2013. The German Academic Exchange Program (DAAD) has organized and financed this visit which will allow the students to know the research and teaching activities of the AES-TUB group, and also will be the starting point for future research collaborations.

## June 12th 2013, 10 a.m., EN 185: Achronix tech talk

The outline of the talk is:

• Overview of the FPGA market;
• Achronix in the High End FPGA market;
• Achronix Value Proposition versus incumbent high end FPGA vendors;
• INTEL / Achronix partnership : Current 22 nm Tri-gate process product, 14 nm products 2014/201, 10 nm;
• Products available at Achronix in 2nd H 2013;
• SW tools presentation (video);
• HD1000 demo board ;

## SoSe 13- A New Course: Computer Arithmetic: Circuit Respective

The advance of modern embedded systems, and their high computation capabilities mainly depends on their ability to perform arithmetic operation in an efficient manner. This course is intended to increase the Knowledge about the design of embedded arithmetic circuits as well as the scientific background of these circuits. This will help the students to gain more details about the design of arithmetic processing units and more practical experience in the implementation of digital systems. The students will increase their experience in the use of hardware description languages to model and implement digital systems. The implementation of these circuits using VHDL/FPGA will be included as well.

## April 21, 2013: AES-Paper at ISPASS 2013.

The paper "Why a Single Chip Causes Massive Power Bills - GPUSimPow: A GPGPU Power Simulator" by Jan Lucas, Sohan Lal, Michael Andersch, Mauricio Alvarez-Mesa and Ben Juurlink has been accepted at the 2013 International Symposium on Performance Analysis of Systems and Software which will be held from April 21-23 in Austin, Texas, US. The paper details much of the work performed by AES' Low-Power GPU group concerning power simulation.

## 12.04.2013, 10h, EN 642: Bachelor Thesis: High-Throughput Communication Interface for the Xilinx XUPV5 Evaluation Platform -Lester Kalms

Due to the increasing tasks of processors in computer systems and the growing complexity, it is not wrong to outsource some of these tasks to relieve the processor. Some of these tasks are done by expansion cards, such as graphic-cards or sound-cards. These cards communicate nowadays via PCI-Express with the rest of the system. Peripherals cards can of course also support other tasks. A platform for the development of an expansion card is for example provided by Xilinx with the evaluation platform XUPV5 [5]. In order to communicate with the card, an interface is needed on the hardware and on the software side of the communication. This can be developed with this card and the help of the Xilinx tools. In times of ever-increasing amounts of data, a correspondingly high data throughput is needed, which is in theory feasible with PCI-Express. This thesis deals with the development of an interfaces that communicates via PCI-Express and of how to maximize the data throughput. This Thesis has been done to support the work of others, which want to develop an efficient expansion card. The second chapter deals with PCI-Express and explains fundamental things to create a basic understanding. It explains what PCI-Express is, how communication works and what data throughput can be achieved. PCI-Express communicates via packets. These packets are called "Transaction Layer Packets". The third chapter deals with the system which has been developed. It is described how the hardware design works as a whole and in detail and how the hardware design has been implemented. It will also be described how the software system is created and how it works, and especially how these two systems interact with each other. The following chapters include the practical work. The fourth chapter describes how a running system, which satisfies the requirements, has been created. The system described in the previous chapter was able to communicate, but there were still some errors in various situations. It explains what has been done to correct these errors and what did not work and why did it not work. The fifth chapter deals with the increasing of the data throughput and it also includes some measurements. For easier handling and measurement, a user application was implemented. In the final chapter the results will be commented, interpreted and compared with the theory. Finally, there is an outlook on methods that still can be tested or that have not yet been tested completely.

## 8-11 April 2013: a 4K H.265/HEVC real-time decoder at NABShow 2013 in Las Vegas

A 4K H.265/HEVC real-time decoder has been presented at the NABShow in Las Vegas, Nevada, USA during April 8-11 2013. The demo consisted of a software based decoder running a multicore PC connected to a 4K 84 inches TV. It was presented at the Fraunhofer HHI Booth C7843. The real-time decoder has been developed as a part of a collaborarion between the Fraunhofer Heinrich Hertz Institute (HHI) and the AES group of TU-Berlin. The demo was presented by Benjamin Bross from Fraunhofer HHI and Mauricio Alvarez-Mesa from Fraunhofer HHI and TU-Berlin.

## 14/03/2013, 10h, EN 642: "Migen - a Python toolbox for building complex digital hardware"-Sébastien Bourdeauducq

Despite being faster than schematics entry, hardware design with Verilog and VHDL remains tedious and inefficient for several reasons. The event-driven model introduces issues and manual coding that are unnecessary for synchronous circuits, which represent the lion's share of today's logic designs. Counter-intuitive arithmetic rules result in steeper learning curves and provide a fertile ground for subtle bugs in designs. Finally, support for procedural generation of logic (metaprogramming) through "generate" statements is very limited and restricts the ways code can be made generic, reused and organized.
To address those issues, we have developed the Migen FHDL library that replaces the event-driven paradigm with the notions of combinatorial and synchronous statements, has arithmetic rules that make integers always behave like mathematical integers, and most importantly allows the design's logic to be constructed by a Python program. This last point enables hardware designers to take advantage of the richness of the Python language - object oriented programming, function parameters, generators, operator overloading, libraries, etc. - to build well organized, reusable and elegant designs.
Other Migen libraries are built on FHDL and provide various tools such as a system-on-chip interconnect infrastructure, a dataflow programming system, a more traditional high-level synthesizer that compiles Python routines into state machines with datapaths, and a simulator that allows test benches to be written in Python.
URL:  http://milkymist.org/3/migen.html

## 02/01/2013 - 11 a.m.: Master school - ICT Innovation - information event

Invitation to the information event

On February, 1st 2013 at 11 a.m. an information event for the "European Dual Degree Master in ICT innovation" will take place in room TEL AB (Telefunken tower). We would like to invite all interested students, and especially those who will finish their BSc degree until August 2013.

The dual degree master program "ICT Innovation" will start in the winter term 2013/14. The application deadline is April, 15th 2013.

## 31.01.2013, 10h, EN 642: "Composing Execution Times on Multicore Processors" - J. Reinier van Kampenhout

The use of multicore processors in embedded systems promises to reduce the space, weight and power requirements while offering increased functionality. To enable these benefits however, a runtime environment must be able to execute multiple safety-critical applications in parallel with non-critical applications. An underlying problem in multicores is the use of shared resources, which leads to interference between applications and unpredictable timing behaviour which is not acceptable for critical applications with hard real-time requirements.

In this research we will conceive and implement a concept for the execution of real-time applications on multicore processors with a composable timing behaviour. In our approach we decompose applications into basic blocks whose, behaviour is deterministic and can be determined empirically. Using models that capture the essential properties of the HW and SW we construct a deployment scheme out of these blocks. The result is a system on which multiple mixed-criticality applications are executed in parallel, each of which has a timing behaviour that is composed out of that of its basic blocks. Thus our method guarantees isolation between applications and simplifies worst case execution analysis, independent of the hypervisor or OS. The usage of resources can furthermore be optimized by allocating any unused resources dynamically to non-critical applications at run time. We will prove the effectiveness of our concept by comparing the variation in execution times to those achieved with purely static scheduling, fixed-priority scheduling and virtualization.

## 21.01.2013: Mr. Tamer Dallou has won a “Best Poster Award” at the HiPEAC 2013.

Mr. Tamer Dallou has won a “Best Poster Award” for our joint poster “Nexus++: A hardware Task Manager for the StarSs Programming Model” at the 8th International Conference on High-Performanceand Embedded Architectures and Compilers HiPEAC 2013, January 2013, Berlin, Germany.

Abstract:
Recently, several programming models have been proposed that try to relieve parallel programming. One of these programming models is StarSs. In StarSs, the programmer has to identify pieces of code that can be executed as tasks, as well as their inputs and outputs. Thereafter, the runtime system (RTS) determines the dependencies between tasks and schedules ready tasks onto worker cores. Previous work has shown, however, that the StarSs RTS may constitute a bottleneck that limits the scalability of the system and proposed a hardware task manager called Nexus to eliminate this bottleneck. Nexus has several limitations, however. For example, the number of inputs and outputs of each task is limited to a fixed constant and Nexus does not support double buffering. Here we present Nexus++ that addresses these as well as other limitations. Experimental results show that double buffering achieves a speedup of $54\times$, and that Nexus++ significantly enhances the scalability of applications parallelized using StarSs.

## Jan. 2013 : HiPEAC 2013 in Berlin.

The HiPEAC conference will be held in Berlin  from Monday 21 to Wednesday January 23, 2013. The HiPEAC conference is the premier forum for experts in computer architecture, programming models, compilers and operating systems for embedded and general-purpose systems in Europe. In 2013 the general chairs will be Ben Juurlink of TU Berlin and Keshav Pingali of the University of Texas, Austin. Program chairs are André Seznec of INRIA Rennes and Lawrence Rauchwerger of Texas A&M University. Paper selection is performed by the ACM journal TACO. More than 500 people attended the HiPEAC 2012 conference in Paris. Hopefully HiPEAC 2013 will be as successful. For more information, stay tuned at http://www.hipeac.net/conference/berlin.

## 2012: Book "Scalable Parallel Programming Applied to H.264/AVC Decoding"

The book titled  "Scalable Parallel Programming Applied to H.264/AVC Decoding" co-authored by Ben Juurlink, Mauricio Alvarez-Mesa, Chi Ching Chi, Arnaldo Azevedo, Cor Meenderinck and Alex Ramirez has been published by Springer as part of the series SpringerBriefs in Computer Science. The book can be purchased from several internet retailers. More information can be found at Springer webpage: http://www.springer.com/engineering/signals/book/978-1-4614-2229-7

## Nov. 1, 2012: AES-Paper in the IEEE Transactions on circuits and Systems for Video Technology

The paper "Parallel Scalability and Efficiency of HEVC Parallelization Approaches" by C.C. Chi, M. Alvarez-Mesa, B. Juurlink, G. Clare, F. Henry, S. Pateux and T. Schierl, has been accepted in the IEEE Transactions on circuits and Systems for Video Technology. The paper is part of a special issue about High Efficiency Video Coding (HEVC) that will appear in December 2012. The paper can now be accessed at ieeeXplore: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6327343&isnumber=4358651

## Nov. 10, 2012:AES-Paper in the Journal of Signal Processing Systems.

The paper "Parallel HEVC Decoding on Multi- and Many-core Architectures. A Power and Performance Analysis" by C.C. Chi, M. Alvarez-Mesa, J. Lucas, B. Juurlink, and T. Schierl, has been accepted in the Journal of Signal Processing Systems. It will appear soon in a special issue about Design and Implementation of Signal Processing Systems.

## Nov. 12, 2012: "SynZEN: A Hybrid TTA/VLIW Architecture with a Distributed Register File" ar the NORSHIP 2012

The paper "SynZEN: A Hybrid TTA/VLIW Architecture with a Distributed Register File" by S. Hauser, N. Moser, B. Juurlink, accepted at the NORCHIP - The Nordic Microelectronics event 2012 which will be held in Copenhagen, Denmark, on Nov. 12 - Nov. 13 2012. More information about NORCHIP 2012 can be found at www.norchip.org.

## Oct 23th, 11h, KIT: High Efficiency Video Coding on Multi- and Many-core Architectures by M. Alvarez Mesa

Dr. Mauricio Alvarez-Mesa will give an invited talk at Karlsruhe Institute of Technology titled "High Efficiency Video Coding on Multi- and Many-core Architectures" in which he will present the latest results of the AES research on HEVC decoding on parallel architectures. The talk will be held on October 23th, at 11:00 am at Karlsruher Institut für Technologie (KIT), Institut für Prozessdatenverarbeitung und Elektronik (IPE), Karlsruhe, Germany.

## 11. Oct 2012- 10h-EN 642: Scalable Runtime and OS Abstractions for Mesh-Based MultiCores (Prof. Frank Mueller)

Current trends in microprocessors are to steadily increase the number of cores. As the core count increases, the network-on-chip (NoC) topology has changed from buses over rings and fully connected meshes to 2D meshes.

This work contributes NoCMsg, a low-level message passing abstraction over NoCs. NoCMsg is specifically designed for large core counts in 2D meshes. Its design ensures deadlock free messaging for wormhole Manhattan-path routing over the NoC. Experimental results on the TilePro hardware platform show that NoCMsg can significantly reduce communication times when compared with other NoC-based message approaches. They further demonstrate the potential of NoC messaging to outperform shared memory abstractions, such as OpenMP, as core counts and inter-process communication increase.

This work further explores the benefits of novel runtime and operating systems abstractions for large scale multicores. On top of NoCMsg, a distributed OS abstraction is promoted instead of the traditional shared memory view on a chip. This distributed kernel features a pico-kernel per core. Sets of pico-kernels are controlled by micro-kernels, which are topologically centered within a set of cores. Cooperatively, micro-kernels comprise the overall operating system in a peer-to-peer fashion.

Biography: Frank Mueller () is a Professor in Computer Science and a member of multiple research centers at North Carolina State University. Previously, he held positions at Lawrence Livermore National Laboratory and Humboldt University Berlin, Germany. He received his Ph.D. from Florida State University in 1994. He has published papers in the areas of parallel and distributed systems, embedded and real-time systems and compilers. He is a member of ACM SIGPLAN, ACM SIGBED and a senior member of the ACM and IEEE Computer Societies as well as an ACM Distinguished Scientist. He is a recipient of an NSF Career Award, an IBM Faculty Award, a Google Research Award and a Fellowship from the Humboldt Foundation.</pre><pre>

## 11.Oct 2012: Courses in WS2012/13

The Course information for the current semester is online . We would particularly like to report the new course AES for bachelor students.

## Sept 30- Oct 3, 12:"Improving the Parallelization Efficiency of HEVC Decoding" at the ICIP 2012

The paper "Improving the Parallelization Efficiency of HEVC Decoding"
by C. C. Chi, M. Alvarez-Mesa, B. Juurlink, V. George and T. Schierl has been accepted at the 2012 IEEE International Conference on Image Processing (ICIP) which will be held in Orlando, Florida, USA, on Sept. 30 - Oct. 3 2012. This paper is the second of a collaboration between the AES group and the Multimedia Communications Group of the Fraunhofer HHI Institute on the topic of parallel  processing for HEVC. More information about ICIP-2012 can be found at http://icip2012.com.

## Sept 10, 2012: "Hardware-Based Task Dependency Resolution for the StarSs Programming Model" at SRMPDS'12

The paper "Hardware-Based Task Dependency Resolution for the StarSs Programming Model" by Tamer Dallou and Ben Juurlink has been accepted at the "SRMPDS'12 - Eighth International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems", which will be held in conjunction with "ICPP'12 - The 2012 International Conference on Parallel Processing" in Pittsburgh, PA on September 10, 2012.
This paper is a result of the research conducted at AES as part of the ENCORE project. More information on SRMPDS can be found at:
http://www.mcs.anl.gov/~kettimut/srmpds/

## Sept 5-8, 2012: "A Novel Predictor-based Power-Saving Policy for DRAM Memories" at the 15th EUROMICRO Conference on Digital System Design (DSD)

The paper "A Novel Predictor-based Power-Saving Policy for DRAM Memories" by Gervin Thomas, Karthik Chandrasekar, Benny Akesson, Ben Juurlink and Kees Goossens has been accepted at the 15th EUROMICRO Conference on Digital System Design (DSD), Cesme, Izmir, Turkey on September 5th - September 8th, 2012. This paper is a collaboration between the AES group (TU-Berlin) and Electronic Systems group (TU Eindhoven). More information about DSD-2012 can be found at http://www.univ-valenciennes.fr/congres/dsd2012/.

## August 27, 2012: "An Optimized Parallel IDCT on Graphics Processing Units" at HeteroPar'2012

The paper "An Optimized Parallel IDCT on Graphics Processing Units" by Biao Wang, Mauricio Alvarez-Mesa, Chi Ching Chi, and Ben Juurlink has been accepted at the 2012 International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar'2012) which will be held in Rhodes Island, Greece on August 27, 2012. The paper presents the work of offloading H.264 IDCT kernel to the GPUs which has been conducted at AES as part of the LPGPU project. More information on HeteroPar can be found at http://pm.bsc.es/heteropar12/.

## July 31, 2012: AES has setup a testbed to accurately measure GPU power consumption.

AES has setup a testbed to accurately measure GPU power consumption. This testbed is being used to evaluate power reduction techniques on available GPUs. It will also be used to validate the power modeling of GPUSimPow, the GPU power simulator developed within the LPGPU project. Its high bandwidth and high sampling speeds enable it to accurately measure short, sub-ms power events.
The AES developed measurement software allows developers to pinpoint power consumption down to the individual kernel.

## 16-19 july, 12: "Using OpenMP Superscalar for Parallelization of Embedded and Consumer Applications" at the SAMOS XII

The paper "Using OpenMP Superscalar for Parallelization of Embedded and Consumer Applications" by M. Andersch, C.C. Chi and Ben Juurlink has been accepted at the 2012 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS) which will be held in Samos, Greece on July 16.-19. 2012. The paper is the latest of the research concerning the OpenMP Superscalar programming model which has been conducted at AES as part of the ENCORE project. More information on SAMOS can be found at http://samos.et.tudelft.nl/samos_xii/html/.

## July 11, 2012: "Nexus++: A hardware Task Manager for the StarSs Programming Model" at ACACES'12

The poster "Nexus++: A hardware Task Manager for the StarSs Programming Model" by Tamer Dallou and Ben Juurlink has been presented at the "ACACES'12 - Eighth International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems", which was
held in Fiuggi, Italy, on 8-14 July, 2012.
This poster presents some of the results of the research conducted at AES as part of the
http://www.hipeac.net/summerschool/

## 8-14 july, 12: Mr. Tamer Dallou attends ACACES 2012.

Mr. Tamer Dallou was awarded a HiPEAC grant to attend the Eighth International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems ACACES 2012, 8-14 july, 2012, Fiuggi, Italy.

The "HiPEAC Summer School" is a one week summer school for computer architects and compiler builders working in the field of high performance computer architecture and compilation for embedded systems. The school aims at the dissemination of advanced scientific knowledge and the promotion of international contacts among scientists from academia and industry.

## AES group purchases TILE-Gx36 many-core.

The AES group of TU Berlin has purchased a state-of-the-art TILE-Gx36 many-core with 36 64-bit processor cores (tiles) from Tilera (tilera.com). Soon researchers and students of AES will be able to work on this state-of-the-art many-core processor. For more information about the TILE-Gx processor family, see http://tilera.com/products/processors/TILE-Gx_Family.

## May 12: Article in ACM SIGARCH Computer Architecture News.

Ben Juurlink and his PhD graduate Cor Meenderinck have published an article entitled "Amdahl's Law for Predicting the Future of Multicores Considered Harmful" in the current (May 2012) issue of Computer Architecture News, which is published by the ACM Special Interest Group on Computer Architecture (SIGARCH) [http://www.sigarch.org/]. In the article they consider how the predictions in the influential paper of Hill and Marty [1] change when instead of Amdahl's Gustafson's law is assumed. They also propose a different scaling equation called Generalized Scaled Speedup Equation (GSSE) that encompasses Amdahl's as well as Gustafson's law. [1] Mark D. Hill, Michael R. Marty: Amdahl's Law in the Multicore Era. IEEE Computer 41(7): 33-38 (2008)

## HiPEAC '13: Call for papers

The 8th HiPEAC conference will take place in Berlin, Germany from Monday 21 to Wednesday January 23, 2013.

For submission details, please refer to http://mc.manuscriptcentral.com/taco.

• Workshops/tutorials: June 1, 2012
• Papers: June 18, 2012
• Posters: October 15, 2012
• Early Registration Deadline: December 22, 2012

## 16 May 12: Prof. Dr. Ben Juurlink in the Map2MPSoC/SCOPES.

Prof. Ben Juurlink will give an invited keynote at the 5th Workshop on Mapping of Applications to MPSoCs and 15th International Workshop on Software and Compilers for Embedded Systems, which will be held May 15-16 in the beautiful Schloss Rheinfels hotel at St. Goar, Germany (http://www.scopesconf.org/scopes-12/)

## 10. Mai 12-10h - Room EN 642: REFLEX (Richard Weickelt)

REFLEX is a framework for deeply embedded control systems. It is based upon the event-flow model, which greatly supports component-centric development of concurrent applications. In combination with multiple scheduling directives, interrupt handling and power management facilities, developers can create applications that are both, deadlock-free and totally predictable.

The library is implemented in C++ and benefits from its powerful language features. Only few parts are platform dependent and can be ported to new architectures with very little effort. A standard compiler like g++ is the only requirement.

REFLEX was developed at the TU Cottbus and is released under the BSD license. In this meeting You will get a brief overview on the framework and its features. After a case study about a real-world product, future research challenges will be discussed.

## 12.04.2012- 10h: Online satellite image processing (Kristian Manthey)

Herr Kristian Manthey wird am 12.04.2012 um 10 Uhr im Rahmen unseres Forschungstreffen einen Vortrag zum Thema: Online satellite image processing (Realtime Image compression on reconfigurable Hardware)  halten. Raum: EN 642.

Abstract: There are challenging requirements on optical systems in spaceborne missions. In the last years, the spatial as well as the spectral resolution of the image data increased resulting in a tremendous increase in data rate. There are also requirement to image quality and constraints resulting from the environment in which the system should be used. An optical system for spaceborne application must have a very high reliability, low power consumption as well as a low weight. The system must be radiation tolerant and able to operate in vacuum and in a high temperature range. With the decrease of the ground sample distance (GSD) or the increase of swath, the amount of data increases significantly. Due to the limitation of transmission bandwidth to the ground station, it is necessary to compress the data. Depending on the requirements of the mission, lossless or lossy compression schemes can be used. Image Compression itself is based on the removal of redundant information in the image, such as spatial or statistical redundancy or of the removal of information not needed in the further processing. Image compression architectures consist of spatial decorrelation to remove spatial redundancy, in case of lossy compression followed by quantization and finally entropy coding to remove statically redundancy. Spatial decorrelation in typical space mission is done by prediction (DPCM), discrete cosine transform (DCT) or discrete wavelet transform (DWT). To achieve best compression results, inter-band decorrelation techniques are necessary. This is obvious because image data has correlation between bands or when using multi spectral sensors (MS) in combination with a sensor which is sensitive in all MS channels.  In the DLR, it is planned to develop a satellite camera which does all tasks - image acquisition, pre-processing, compression, storage, data formatting and communication with the ground station - on a single multi-chip-module (MCM). In a first step, the image compression should be done directly on the image acquisition module. The goal of this thesis is to investigate scenarios, where the ground station interactively requests and decompresses the image data, and to develop a high-speed image compression system on the image acquisition module.

## 25-29 March 2012 : M. Alvarez Mesa presents a paper at ICASSP-2012 in Kyoto, Japan

The paper "Parallel video decoding on the emerging HEVC standard"
by M. Alvarez-Mesa, C. C. Chi, B. Juurlink, V. George and T. Schierl has been accepted at th 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) which will be held in Kyoto, Japan, on March 25 - 30, 2012. The ICASSP meeting is one of the largest technical conference focused on signal processing and its applications. The paper, which is the result of a collaboration between the AES group and the Multimedia Communications Group of the Fraunhofer HHI Institute, will be presented by Mauricio Alvarez-Mesa at the session "Parallel and embedded signal processing systems". More information about ICASSP-2012 can be found at http://www.icassp2012.com.

## 26-27 March 2012: Prof. Dr. Ben Juurlink and Sean Halle present their progress in the LPGPU project in Cambridge

Prof. Dr. Ben Juurlink and Sean Halle are going to Cambridge for the first LPGPU face to face meeting, on March 26 and 27.  They will discuss interactions between the work-packages, the low-power industry-space, and tackle simulator questions.  Each participant is going to present their progress in the first year of LPGPU in preparation for the first-year review.

## 25-29 Feb. 2012 : Michael Andersch presents a poster at PPoPP in New Orleans.

The paper "Programming Parallel Embedded and Consumer Applications in OpenMP Superscalar" by Michael Andersch, Chi Ching Chi, and Ben Juurlink was accepted as a poster presentation at the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). The student Michael Andersch will present the poster in New Orleans from February 25 to February 29, 2012. For more information about the PPoPP conference, see http://dynopt.org/ppopp-2012/.

## 1.01.12 - 15.02.1212: EIT ICT Master School geht an den Start

Die Bewerbungsphase für die neue Master School läuft vom 1. Januar bis 15. Februar 2012. Weitere Informationen unter eitictlabs.masterschool.eu

## 10.01.2011 -16h: A System-Level Approach to Parallelism (Sean Halle)

Vortragsankündigung: A System-Level Approach to Parallelism (Sean Halle)
Dienstag, den 11. Januar 2011 um 16 Uhr im E-N 360.

## 06.07.2010- 15 Uhr: Hardware Task Management Support for Task-Based Programming Models: the Nexus System (M.Sc. Cor Meenderinck)

Vortragsankündigung:Hardware Task Management Support for Task-Based Programming Models: the Nexus System (M.Sc. Cor Meenderinck)
Mittwoch, den 14. Juli 2010 um 15 Uhr im FR3043

## 31.03.2010: Lehrangebot im SS2010

Das Lehrangebot unseres Fachgebietes kann im Bereich Studium und Lehre eingesehen werden. Besonders hinweisen möchten wir auf das Master Modul "Advance Computer Architectures" für Informatiker und Technische Informatiker, welches in diesem Semester erstmalig angeboten wird.

## 8.03.10- 10h : New architectures for the final scaling of the CMOS world (Professor Luigi Carro)

Vortragsankündigung: New architectures for the final scaling of the CMOS world (Professor Luigi Carro). Montag, den 08.03.2010 10 Uhr im FR5516.

## 17.02.2010 - 10h: Evaluation of Parallel H.264 Decoding Strategies on the Cell Broadband Engine (Mr. Chi Ching Chi)

Vortragsankündigung:Evaluation of Parallel H.264 Decoding Strategies on the Cell Broadband Engine (Mr. Chi Ching Chi). Mittwoch, den 17.02.2010 10 Uhr im FR 3043.

## 12.01.2010: Mündliche Prüfung in TechGI2 (2. Wiederholungsprüfung)

Das Modul Technische Grundlagen der Informatik 2 (TechGI2) wird ab SS 2010 von dem neuen Leiter des Fachgebiets Architektur eingebetteter Systeme (AES), Prof. Juurlink, übernommen. Er wird dabei einige Veränderungen in der Umsetzung der in der Modulbeschreibung vorgegebenen Inhalte vornehmen, die sich auch in den Prüfungsfragen niederschlagen werden.
Der bisherige Veranstalter des Moduls, Hr. Flik, verliert seine Prüfungsberechtigung zum Ende des WS 2009/10, womit dann die Möglichkeit der mündlichen Prüfung über die bisherigen Inhalte wegfällt.
Für die derzeitigen Interessenten an einer solchen mündlichen Prüfung bietet Hr. Flik Prüfungstermine bis Mitte März 2010 an. Die Prüfungstage werden festgelegt, wenn die ersten Prüfungsanfragen vorliegen (flik(at)cs.tu-berlin.de). Anzugeben sind dabei die Studienrichtung, die Matr.-Nr. sowie der frühest möglich Wunschtermin.
Der eigentliche Prüfungstermin wird erst nach Vorlage der beim Prüfungsamt erforderlichen Prüfungsanmeldung vergeben. Diese Meldung muß wenigstens 7 Tage vor dem Prüfungstermin vorliegen (im Sekretariat von AES oder RT).

## 27.11.2009: Rufannahme von Professor Dr. Ben Juurlink.

Rufannahme von Professor Dr. Ben Juurlink, Professor an der Delft University of Technology, Niederlande, auf die W3-Professur für das Fachgebiet Rechnerarchitektur – Architektur eingebetteter Systeme.