# **Activity Report 2011** ## **Team CAMUS** # Compilation pour les Architectures MUlti-coeurS IN COLLABORATION WITH: Laboratoire des sciences de l'image, de l'informatique et de la télédétection (L.S.I.I.T) RESEARCH CENTER Nancy - Grand Est THEME **Architecture and Compiling** ## **Table of contents** | 1. | Members | | |----|-----------------------------------------------------------------------|----| | 2. | Overall Objectives | | | | 2.1. Overall Objectives | 1 | | | 2.2. Highlights | 1 | | 3. | Scientific Foundations | | | | 3.1. Research directions | 2 | | | 3.2. Static parallelization and optimization | 3 | | | 3.2.1. State of the art | 3 | | | 3.2.2. Adapting parallelization to multicore architecture | 4 | | | 3.2.3. Expressing many potential parallelisms | 4 | | | 3.3. Profiling and execution behavior modeling | 5 | | | 3.3.1. Selective profiling and interaction with the compiler | 5 | | | 3.3.2. Profiling and dynamic optimization | 6 | | | 3.3.3. Run time program modeling | 6 | | | 3.4. Dynamic parallelization and optimization, virtual machine | 7 | | | 3.4.1. State of the art | 7 | | | 3.4.2. General objective: building a virtual machine | 9 | | | 3.4.3. Adaptation of the intermediate code to the target architecture | 9 | | | 3.4.4. High level parallelization and native code creation | 10 | | | 3.4.5. Low level parallelization | 10 | | | 3.4.6. Distribution, execution and profiling | 10 | | | 3.4.7. Re-parallelization, thread mutation or rollback | 10 | | | 3.5. Proof of program transformations for multicores | 11 | | | 3.5.1. State of the art | 11 | | | 3.5.1.1. Certification of low-level codes. | 11 | | | 3.5.1.2. Certification of a compiler. | 11 | | | 3.5.1.3. Semantics of directives. | 11 | | | 3.5.1.4. Definition of a parallel programming model. | 12 | | | 3.5.1.5. Programming models for multicore architectures. | 12 | | | 3.5.1.6. Refinement of programs. | 12 | | | 3.5.2. Main objective: formal proof of analyses and transformations | 13 | | | 3.5.3. Proof of transformations in the polyhedral model | 13 | | | 3.5.4. Validation under hypothesis | 13 | | | 3.5.5. Rejecting incorrect parallelizations | 14 | | 4. | | | | 5. | | | | | 5.1. PolyLib | 14 | | | 5.2. ZPolyTrans | 14 | | | 5.3. NLR | 15 | | | 5.4. Dynamic version selector | 15 | | | 5.5. Binary files decompiler | 15 | | | 5.6. Dynamic dependency analyser | 16 | | | 5.7. VMAD software and LLVM | 16 | | | 5.8. Polyhedral prover | 17 | | 6. | | | | | 6.1. VMAD and LLVM | 17 | | | 6.2. Dynamic version selector | 17 | | | 6.3. Binary parallelization | 18 | | | 6.4. Modeling the dynamic behavior of executable programs | 18 | | | 6.5. Dynamic dependence analysis | 19 | |----|----------------------------------------------------------------|----| | | 6.6. Dealing with arithmetic overflows in the polyhedral model | 19 | | 7. | Partnerships and Cooperations | | | | 7.1. National Initiatives | 21 | | | 7.2. International Initiatives | 21 | | | 7.2.1. INRIA Associate Teams | 21 | | | 7.2.2. INRIA International Partners | 21 | | | 7.2.3. Visits of International Scientists | 22 | | | 7.2.3.1. Visits | 22 | | | 7.2.3.2. Internships | 22 | | | 7.2.4. Participation In International Programs | 22 | | 8. | Dissemination | | | | 8.1. Animation of the scientific community | 22 | | | 8.2. Teaching | 23 | | 9. | Bibliography | | **Keywords:** Compiling, Embedded Systems, Hardware Accelerators, Proofs Of Programs, Formal Methods, Processors ## 1. Members #### **Faculty Members** Philippe Clauss [Team leader, Professor, Université de Strasbourg, HdR] Éric Violard [Associate Professor, Université de Strasbourg, HdR] Vincent Loechner [Associate Professor, Université de Strasbourg] Alain Ketterlin [Associate Professor, Université de Strasbourg] Julien Narboux [Associate Professor, Université de Strasbourg] Nicolas Magaud [Associate Professor, Université de Strasbourg] #### **PhD Students** Benoît Pradelle [Université de Strasbourg] Alexandra Jimborean [Université de Strasbourg] Jean-François Dollinger [Université de Strasbourg] ## 2. Overall Objectives ## 2.1. Overall Objectives The CAMUS team is focusing on developping, adapting and extending automatic parallelizing and optimizing techniques, as well as proof and certification methods, for the efficient use of current and future multicore processors. The team's research activities are organized into five main issues that are closely related to reach the following objectives: performance, correction and productivity. These issues are: static parallelization and optimization of programs (where all statically detected parallelisms are expressed as well as all "hypothetical" parallelisms which would be eventually taken advantage of at runtime), profiling and execution behavior modeling (where expressive representation models of the program execution behavior will be used as engines for dynamic parallelizing processes), dynamic parallelization and optimization of programs (such transformation processes running inside a virtual machine), object-oriented programming and compiling for multicores (where object parallelism, expressed or detected, has to result in efficient runs), and finally program transformations proof (where the correction of many static and dynamic program transformations has to be ensured). ## 2.2. Highlights - The Associate Team *Ancome*, regrouping Camus and their Argentinian partners from the LaFhis team of the University of Buenos Aires, has started officially its activities in January 2011. - Philippe Clauss, Vincent Loechner and Diego Garbervetsky (Associate Team Ancome), published a chapter in the book *Energy-Aware Memory Management for Embedded Systems*, CRC Press 2012, on the paramatric estimation of memory consumption of programs. - Benoît Pradelle defended his PhD thesis December the 20th 2011 at the University of Strasbourg. He presented his results on dynamic selection of parallel code versions, automatic parallelization of binary codes and polyhedral dynamic and speculative parallelization. His jury was composed by Albert Cohen (reviewer), from INRIA, Sanjay Rajopadhye (reviewer), Colorado State University, USA, John Cavazos (examiner), University of Delaware, USA, Philippe Clauss, Alain Ketterlin and Vincent Loechner (advisors). • Alexandra Jimborean received the best poster award at the international conference Code Generation and Optimization (CGO) that was held in Chamonix, France, in April 2011. ## 3. Scientific Foundations ## 3.1. Research directions The various objectives we are expecting to reach are directly related to the search of adequacy between the sofware and the new multicore processors evolution. They also correspond to the main research directions suggested by Hall, Padua and Pingali in [44]. Performance, correction and productivity must be the users' perceived effects. They will be the consequences of research works dealing with the following issues: - Issue 1: Static parallelization and optimization - Issue 2: Profiling and execution behavior modeling - Issue 3: Dynamic program parallelization and optimization, virtual machine - Issue 4: Object-oriented programming and compiling for multicores - Issue 5: Proof of program transformations for multicores Efficient and correct applications development for multicore processors needs stepping in every application development phase, from the initial conception to the final run. Upstream, all potential parallelism of the application has to be exhibited. Here static analysis and transformation approaches (issue 1) must be processed, resulting in a *multi-parallel* intermediate code advising the running virtual machine about all the parallelism that can be taken advantage of. However the compiler does not have much knowledge about the execution environment. It obviously knows the instruction set, it can be aware of the number of available cores, but it does not know the effective available resources at any time during the execution (memory, number of free cores, etc.). That is the reason why a "virtual machine" mechanism will have to adapt the application to the resources (issue 3). Moreover the compiler will be able to take advantage only of a part of the parallelism induced by the application. Indeed some program information (variables values, accessed memory adresses, etc.) being available only at runtime, another part of the available parallelism will have to be generated on-the-fly during the execution, here also, thanks to a dynamic mechanism. This on-the-fly parallelism extraction will be performed using speculative behavior models (issue 2), such models allowing to generate speculative parallel code (issue 3). Between our behavior modeling objectives, we can add the behavior monitoring, or profiling, of a program version. Indeed current and future architectures complexity avoids assuming an optimal behavior regarding a given program version. A monitoring process will allow to select on-the-fly the best parallelization. These different parallelizing steps are schematized on figure 1. The more and more widespread usage of object-oriented approaches and languages emphasizes the need for specific multicore programming tools. The object and method formalism implies specific execution schemes that translate in the final binary by quite distant elementary schemes. Hence the execution behavior control is far more difficult. Analysis and optimization, either static or dynamic, must take into account from the outset this distortion between object-oriented specification and final binary code: how can object or method parallelization be translated (issue 4). Our project lies on the conception of a production chain for efficient execution of an application on a multicore architecture. Each link of this chain has to be formally verified in order to ensure correction as well as efficiency. More precisely, it has to be ensured that the compiler produces a correct intermediate code, and that the virtual machine actually performs the parallel execution semantically equivalent to the source code: every transformation applied to the application, either statically by the compiler or dynamically by the virtual machine, must preserve the initial semantics. They must be proved formally (issue 5). Figure 1. Automatic parallelizing steps for multicore architectures In the following, those different issues are detailed while forming our global and long term vision of what has to be done. ## 3.2. Static parallelization and optimization Participants: Vincent Loechner, Philippe Clauss, Éric Violard, Alexandra Jimborean. Static optimizations, from source code at compile time, benefit from two decades of research in automatic parallelization: many works address the parallelization of loop nests accessing multi-dimensional arrays, and these works are now mature enough to generate efficient parallel code [27]. Low-level optimizations, in the assembly code generated by the compiler, have also been extensively dealt for single-core and require few adaptations to support multicore architectures. Concerning multicore specific parallelization, we propose to explore two research directions to take full advantage of these architectures. They are described below. #### 3.2.1. State of the art Upstream, an easy interprocedural dependence analysis allows to handle complete programs (but recursivity: recursive functions must be transformed into iterative functions). Concerning iterative control we will use the polyhedral model, a formalism developed these last two decades, which allows to represent the execution of a loop nest by scanning a polytope. When compiling an application, if it contains loop nests with affine bounds accessing scalars or arrays accessed using affine functions, the polyhedral model allows to: - compute the dependence graph, which describes the order in which the dependent instructions must be executed [35]; - generate a schedule, which extracts some parallelism from the dependence graph [36], [37]; - generate an allocation, which assigns a processor (or a core) to a set of iterations of the loop nest to be scanned. This last allocation step needs a thorough knowledge of the target architecture, as many crucial choices will result in performance hazards: for example, the volume and flow of inter-processor communications and synchronization; the data locality and the effects of the TLB (Translation Lookaside Buffer) and the various cache levels and distributions; or the register allocation optimizations. There are many techniques to control these parameters, and each architecture needs specific choices, of a valid schedule, of a parallel loop iterations distribution (bloc-, cyclic-, or tiled), of a loop-unrolling factor, as well as a memory data layout and a prefetch strategy (when available). They require powerfull mathematical tools, such as counting the number of integer points contained in a parametric polytope. Our own contributions in this area are significant. Concerning schedule and data placement, we proposed new advances in minimizing the number of communications for parallel architectures [57] and in cache access optimizations [56] [8]. We also proposed essential advances in parametric polytope manipulation [9], [5], developed the first algorithm to count integer points in a parametric polytope as an Ehrhart polynomial [3], and proposed successive improvements of this algorithm [10] [68]. We implemented these results in the free software *PolyLib*, utilized by many researchers around the world. ## 3.2.2. Adapting parallelization to multicore architecture The first research direction to be explored is multicore specific efficient optimizations. Indeed, multicore architectures need specific optimizations, or we will get underlinear accelerations, or even decelerations. Multicore architectures may have the following properties: specific memory hierarchy, with distributed low-level cache and (possibly semi-) shared high level caches; software-controlled memory hierarchies (memory hints, local stores or scratchpads for example); optimized access to contiguous memory addresses or to separate memory banks; SIMD or vectorial execution in groups of cores, and synchronous execution; higher register allocation pressure when several threads use the same hardware (as in GPGPUs for example); etc. A schedule and an allocation must be chosen wisely in order to obtain good performances. On NVIDIA GPG-PUs, using the CUDA language, Baskaran et al. [26] obtained interesting results that have been implemented in their PLuTo compiler framework. However, they are based on many empirical and imprecise techniques, and require simulations to fine-tune the optimizations: they can be improved. Memory hierarchy efficient control is a cornerstone of tomorrow's multicore architectures performance. Compiler-optimizers have to evolve to meet this requirement. Simulation and (partial-) profiling may however remain necessary in some cases, when static analysis reaches its intrinsic limits: when the execution of a program depends on dynamic parameters, when it uses complex pointer arithmetic, or when it performs indirect array accesses for example (as is often the case in *while* loops, out of the scope of the classical polyhedral model). In these cases, the compiler should rely on the profiler, and generate a code that interacts with the dynamic optimizer. This is the link with issues 2 and 3 of this research project. #### 3.2.3. Expressing many potential parallelisms The dynamic optimizer (issue 3) must be able to exploit various parallel codes to compare them and the best one to choose, possibly swapping from a code to another during execution. The compiler must therefore generate different potentially efficient versions of a code, depending on fixed parameters such as the schedule or the data layout, and dynamic parameters such as the tile size or the unrolling factor. The compiler then generates many variants of *effective* parallelism, formally proved by the static analyzer. It may also generate variants of code that have not been formally validated, due to the analyzer limits, and that have to be checked during execution by the dynamic optimizer: *hypothetical* parallelism. Hypothetical parallelism could be expressed as a piece of code, valid under certain conditions. Effective and hypothetical parallelisms are called *potential parallelism*. The variants of potential parallelism will be expressed in an intermediate language that has to be discovered. Using compiler directives is an interesting way to define this intermediate language. Among the usual directives, we distinguish schedule directives for shared memory architectures (such as the OpenMP¹ parallel directive), and placement directives for distributed memory architectures (for example the HPF² ALIGN directive). These two types of directives are conjointly necessary to take full profit of multicore architectures. However, we have to study their complementarity and solve the interdependence or conflict that may arise between them. Moreover, new directives should allow to control data transfers between different levels of the memory hierarchy. <sup>&</sup>lt;sup>1</sup>http://www.openmp.org <sup>&</sup>lt;sup>2</sup>http://hpff.rice.edu We are convinced that the definition of such a language is required in the next advances in compilation for multicore architectures, and there does not exist such an ambitious project to our knowledge. The OpenCL project<sup>3</sup>, presented as an general-purpose and efficient multicore programming environment, is too low-level to be exploitable. We propose to define a new high level language based on compilation directives, that could be used by the skilled programmer or automatically generated by a compiler-optimizer (like OpenMP, recently integrated in the *gcc* compiler suite). ## 3.3. Profiling and execution behavior modeling Participants: Alain Ketterlin, Philippe Clauss, Benoît Pradelle. The increasing complexity of programs and hardware architectures makes it ever harder to characterize beforehand a given program's run time behavior. The sophistication of current compilers and the variety of transformations they are able to apply cannot hide their intrinsic limitations. As new abstractions like transactional memories appear, the dynamic behavior of a program strongly conditions its observed performance. All these reasons explain why empirical studies of sequential and parallel program executions have been considered increasingly relevant. Such studies aim at characterizing various facets of one or several program runs, e.g., memory behavior, execution phases, etc. In some cases, such studies characterize more the compiler than the program itself. These works are of tremendous importance to highlight all aspects that escape static analysis, even though their results may have a narrow scope, due to the possible incompleteness of their input data sets. ## 3.3.1. Selective profiling and interaction with the compiler In its simplest form, studying a given program's run time behavior consists in collecting and aggregating statistics, *e.g.*, counting how many times routines or basic blocks are executed, or counting the number of cache misses during a certain portion of the execution. In some cases, data can be collected about more abstract events, like the garbage-collector frequency or the number and sizes of sent and received messages. Such measures are relatively easy to obtain, are frequently used to quantify the benefits of some optimization, and may suggest some way to improve performance. These techniques are now well-known, but mostly for sequential programs. These global studies have often been complemented by local, targeted techniques focused on some program portions, *e.g.*, where static techniques remain inconclusive for some fixed duration. These usages of profiling are usually strongly related to the optimization they complement, and are set up either by the compiler or by the execution environment. Their results may be used immediately at run time, in which case they are considered a form of run time optimization [1]. They can also be used offline to provide hints to a subsequent compilation cycle, in which case they constitute a form of profile-guided compilation, a strategy that is common in general purpose compilers. For instance, in the context where a set of possible parallelizations have been provided by the compiler (see issue 1), a profiling component can easily be made responsible for testing some relevant condition at run time (e.g., that depends on input data) and for selecting the best between various versions of the code. Beyond such simple tasks, we expect that profiling will, at the beginning of the execution, have enough resources to conduct more elaborate analyzes. We believe that combining an "open" static analysis with an integrated profiling component is a promising approach, first because it may relieve the programmer of a large part of the tedious task of implementing the distribution of computations, and second to free the compiler of the obligation to choose between several optimizations in the absence of enough relevant data. The main open question here is to define precisely the respective roles of the compiler and the profiler, and also the amount and nature of information the former can transmit to the latter. <sup>&</sup>lt;sup>3</sup>http://www.khronos.org/opencl #### 3.3.2. Profiling and dynamic optimization In the context of dynamic optimization, that is, when the compiler's abilities have been exhausted, a profiler can still do useful work, provided some additional capabilities [1]. If it is able to instrument the code the way, e.g., a PIN-tool does [58], it has access to the whole program, including libraries (or, for example, the code of a low-level library called from a scripting language). This means that it has access to portions of the program that were not under the compiler's control. The profiler can then perform dynamic inter-procedural analyzes, for instance to compute dependencies to detect parallelism that wasn't apparent at compile time because of a function call in the body of a loop. More generally, if the profiler is able to reconstruct at run time some representation of the whole program, as in [77] for example, it is possible to let it search for any construct that can be optimized and/or parallelized in the context of the current execution. Several virtual machines, e.g., for Java or Microsoft CLR, have opened this way of optimizing programs, probably because virtual machines need to maintain an intermediate, structured representation of the running program. The possibility of running programs on architectures that include a large number of computing cores has given rise to new abstractions [75], [49], [30]. Transactional memories, for instance, aim at simplifying the management of conflicting concurrent accesses to a shared memory, a notoriously difficult problem [51]. However, the performance of a transaction-based application heavily depends on its dynamic behavior, and too many conflicting accesses and rollbacks, severely affect performance. We bet that the need for multicore specific programming tools will lead to other abstractions based on speculative execution. Because of the very nature of speculation, all these abstractions will require run time evaluation, and maybe correction, to avoid pathological cases. The profiler has a central role here, because it can be made responsible for diagnosing inefficient use of speculative execution, and for taking corrective action, which means that it has to be integrated to the execution environment. We also think that the large scope and almost infinite potential uses of a profiling component may well suggest new parallel program abstractions, specially targeted at run time evaluation and adaptation. #### 3.3.3. Run time program modeling When profiling goes beyond simple aggregation of counts, it can, for example, sample a program's behavior and split its execution into phases. These phases may help target a subsequent evaluation on a new architecture [69]. When profiling instruments the whole program to obtain a trace, *e.g.*, of memory accesses, it is possible to use this trace for: - simulation, e.g., by varying the parameters of the memory hierarchy, - for modeling, e.g., to reconstruct some specific model of the program [77], or to extract dynamic dependencies that help identifying parallel sections [65]. Handling such large execution traces, and especially compressing them, is a research topic by itself [31], [60]. Our contribution to this topic [7] is unusual in that the result of compression is a sequence of loop nests where memory accesses and loop bounds are affine functions of the enclosing loop indices. Modeling a trace this way leads to slightly better average compression rates compared to other, less expressive techniques. But more importantly, it has the advantage to provide a result in symbolic form, and this result can be further analyzed with techniques usually restricted to the static analysis of source code. We plan to apply, in the short term, similar techniques to the modeling of dynamic dependencies, so as to be able to automatically extract parallelism from program traces. This kind of analysis is representative of a new kind of tools than could be named "parallelization assistants" [55], [65]. Properties that can't be detected by the compiler but that appear to hold in one or several executions of a program can be submitted to the programmer, maybe along a suitable reformulation of its program using some class of abstraction, *e.g.*, compiler directives. The goal is to provide help and guidance in adapting source code, in the same way a classical profiling tool helps pinpoint performance bottlenecks. Control and data dependencies are fundamental to such a tool. An execution trace provides an observed reality; for example a trace of memory addresses. If the observed dynamic dependencies provide a set of constraints, they also suggest a complete family of potential correct executions, be they parallel or sequential, and all these executions are equivalent to the reference execution. Being able to handle large traces, and representing them in some manageable way, means being able to highlight medium to large grain parallelism, which is especially interesting on multicore architectures and often difficult for compilers to discover, for example because of the use of pointers and the difficulty of eliminating potential aliasing. This can be seen as a machine learning problem, where the goal is to recover a hidden structure from a large sequence of events. This general problem has various incarnations, depending on how much the learner knows about the original program, on the kind of data obtained by profiling, on the class of structures sought, and on the objectives of the analysis. We are convinced that such studies will enrich our understanding of the behavior of programs, and of the programming concepts that are really useful. It will also lead to useful tools, and will open up new directions for dynamic optimization. ## 3.4. Dynamic parallelization and optimization, virtual machine Participants: Philippe Clauss, Alain Ketterlin, Vincent Loechner, Benoît Pradelle, Alexandra Jimborean. This link in the programming chain has become essential with the advent of the new multicore architectures. Still being considered as secondary with mono-core architectures, dynamic analysis and optimization are now one of the keys for controling those new mechanisms complexity. From now on, performed instructions are not only dedicated to the application functionalities, but also to its control and its transformation, and so in its own interest. Behaving like a computer virus, such a process should rather be qualified as a "vitamin". It perfectly knows the current characteristics of the execution environment and owns some qualitative information thanks to a behavior modeling process (issue 2). It appends a significant part of optimizing ability compared to a static compiler, while observing live resources availability evolution. #### 3.4.1. State of the art Dynamic analysis and optimization, that is to say simultaneous to the program execution, have motivated a growing interest during the last decade, mainly because of the hardware architectures and applications growing complexity. Indeed, it has become more and more difficult to anticipate any program run simply from its source code, either because its control structures introduce some unknown objects before run (dynamic memory allocation, pointers, ...), or because the interaction between the target architecture and the program generates unpredictable behaviors. This is notably due to the appearance of more optimizing hardware units (prefetching units, speculative processing, code cache, branch prediction, etc.). With multicore architectures, this interest is growing even more. Works achieved in this area for mono-core processors have permitted to establish some classification of the so-called dynamic approaches, either based on the used methodologies or on the objectives. The first objective for any dynamic approach is to extract some live information at runtime relying on a profiling process. This essential step is the main objective of issue 2 (see sub-section 3.3). Identifying some "hotspots" thanks to profiling is then used for performance improvement optimizations. Two main approaches can be distinguished: - the *profile-guided* approach, where analysis and optimization of profile information are performed off-line, that is to say statically. A first run is only performed to extract information for driving a re-compilation. Related to this approach, *iterative compilation* consists in running a code that has been transformed following different optimization possibilities (nature and sequencing of the applied optimizations), and then in re-compiling the transformed code guided by the collected performance information, and so on until obtaining a "best" program version. In order to promote a rapid convergence towards a better solution, some heuristics or some machine learning mechanisms are used [22], [64], [63]. The main drawback of such approaches relates to the quality of the generated code which depends on the reference profiled execution, and more precisely on the used input data set, but also on the used hardware. - the *on-the-fly* approach consists in performing all steps at each run (profiling, analysis and transformation). The main constraint of this approach is that the time overhead has to be widely compensated by the benefits it generates. Several works propose such approaches dedicated to specific optimizations. We personally successfully implemented a dynamic data prefetching system for the Itanium processor [1]. Although all these works provided some efficient dynamic mechanisms, their adaptation to multicore architectures yields difficult issues, and even challenges them. It is indeed necessary to control interactions between simultaneous tasks, imposing an additional complexity level which can be fateful for a dynamic system, while becoming too costly in time and space. Some dynamic parallelizing techniques have been proposed in the last years. They are mainly focusing on parallelizing loop-nests, as programs generally spend most of their execution time in iterative structures. The LRPD test [67] is certainly one of the foundation strategies. This method consists in speculatively parallelizing loops. Privatization and reduction transformations are applied to promote a successful application of the strategy. During execution, some tests are performed to verify the speculation validity. In case of invalid speculation, the targeted loop is re-executed sequentially. However, the application range is limited to loops accessing arrays; pointers cannot be handled. Moreover the method is not fully dynamic since an initial static analysis is needed. In [34], Cintra and Llanos present a speculative parallel execution mechanism for loops, where iteration chunks are executed in sliding windows of n threads. The loops are not transformed and the sequential schedule remains as a reference to define a total order on the speculative threads. In order to verify whether some dependencies are violated during the program run, all data structures qualified as speculative, that is to say those being accessed in read-write mode by the threads, are duplicated for each thread and tagged following those states: not accessed, modified, exposed loaded or exposed loaded and later modified. For example, a read-after-write dependency has been violated if a thread owns a data tagged as exposed loaded or exposed loaded and modified, and if a predecessor thread, following the sequential total order, owns the same data but tagged as modified or exposed loaded and modified, while this data has not yet been committed in main memory. Such an approach can be memory-costly as each shared data structure is duplicated. It can be tricky to adjust verification frequencies to minimize time overhead. Some other methods based on the same principle of verifying speculation relatively to the sequential schedule have been proposed recently as in [71], where each iteration of a loop is decomposed into a prologue, a speculative body and an epilogue. The speculative bodies are performed in parallel and each body completion induces a verification. This approach seems to be only well suited for loops which bodies represent significant computation time. Another recent work is the development of SPICE [66] which is a speculative parallelizing system where an entire first run of a loop is initially observed. This observation serves in determining the values reached by some variables during the run. During a next run of the loop, several speculative threads are launched. They consider as initial values of some variables the values that have been observed at the previous run. If a thread reaches the starting value of another thread, it stops. Thus each thread performs a different portion of the loop. But if the loop behavior changes and if another thread starting value is never reached, the run goes on sequentially until completion. The main limits of these propositions are: - they do not alter the initial sequential schedule since always contiguous instruction blocks are speculatively parallelized; - their underlying parallelism is out of control: the characteristics of the generated parallel schedule are completely unknown since they randomly depend on the program instructions, their dependencies and the target machine. If bad performance is encountered, no other parallelization solution can be proposed. Moreover, the effective instruction schedule occurring at program run can significantly vary from one run to another, hence leading to a confusing performance inconsistency. A strategy that would uniquely be based on a transactional memory mechanism, with rollbacks in the case of data races, yields a totally uncontrolable parallelism where performance can not be ensured and not even strongly expected. While being based on efficient prediction mechanisms, a better control over parallelization will permit to provide solutions that are well suited to a varying execution context and to parallelize portions of code that can be parallelized only in some particular context. It is indeed crucial to maximize the potential parallelism of the applications to take advantage of the forthcoming processors comprising several tens of cores. ## 3.4.2. General objective: building a virtual machine As it has already been mentioned, dynamic parallelization and optimization can take place inside a virtual machine. All the research objectives that are presented in the following are related to its construction. Notice that the term of "virtual machine" is employed to group a set of dynamic analysis and optimization mechanisms taking as input a binary code, eventually enriched with specific instructions. We refer to a process virtual machine which main role is dynamic binary optimization from one instruction set to the same instruction set. The taxonomy given in [70] includes this kind of virtual machine. Notice that this virtual machine can run in parallel on the processor cores during the four initial phases (see figure 2), but also simultaneously to the target application, either by sharing some cores with light processes, or by using cores that are useless for the target application. It will also support a transactional memory mechanism, if available. However the foreseen parallelizing strategies do not depend on such a mechanism since our speculative executions are supposed to be as reliable as possible thanks to efficient prediction models, and since they are supported by a specific and higher level rollback mechanism. Anyway if available, a transactional memory mechanism would allow to take advantage of "nearly perfect" prediction models. The virtual machine takes as input an intermediate code expressing several kinds of parallelism on several code extracts. Those kinds of parallelism are either effective, that is to say that the corresponding parallel execution is obviously semantically correct, or hypothetical, that is to say that there is still some uncertainty on the parallelism correctness. In this case, this uncertainty will have to be resolved at run time. This intermediate "multi-parallel" code is generated by the static parallelization described subsection 3.2. It also contains generic descriptions of parallelizing or optimizing transformations which parameters will have to be instanciated by the virtual machine, thanks to its knowledge about the target architecture and the program run-time behavior. #### 3.4.3. Adaptation of the intermediate code to the target architecture The virtual machine first phase is to adapt this intermediate code to the target multicore architecture. It consists in answering the following questions: - What is the suitable kind of parallelism? - What is the suitable parallel task granularity? - What is the suitable number of parallel tasks? - Can we take advantage of a specialized instruction set for some operations? - What are the parameter values for some parallelization or optimization? The multi-parallel intermediate code exhibits different parameters allowing to adapt some parallelizing and optimizing transformations to the target architecture. For example, a loop unrolling will be parametrized by the number of iterations to be unrolled. This number will depend, for example, on the number of available registers and the size of the instruction cache. A parallelizing transformation will depend on several possible parallel instruction schedules. One or several schedules will be selected, for example, depending on the kind of memory hierarchy and the cache sharing among cores. Concerning hypothetical parallelism, this first phase will reduce the number of these propositions to solutions that are well suited to the target architecture. This phase also instruments the intermediate code in order to install the dynamic mechanisms related to profiling and speculative parallel execution. Figure 2. The virtual machine #### 3.4.4. High level parallelization and native code creation From these target architecture related adaptations, a parallel intermediate code is generated. It contains instructions that are specific to the dynamic optimizing and parallelizing mechanisms, *i.e.*, instrumentation instructions to feed the profiling process as well as calls to speculative execution management procedures. A translation into native code executable by the target processor follows. This translation also allows to keep trace of the code extracts that have to be modified during the run. #### 3.4.5. Low level parallelization The binary version of the code exhibits new parallelism and optimization sources that are specific to the instruction set and to the target architecture capabilities. Moreover, some dynamic optimizations are dedicated to specific instructions, or instruction blocks, as for example the memory reads which time performances can be dynamically improved by data prefetching [1]. Thus the binary code can be transformed and instrumented as well. #### 3.4.6. Distribution, execution and profiling The so built executable code is then distributed among the processor cores to be run. During the run, the instrumentation instructions feed the profiler with information for execution monitoring and for behavior models construction (see subsection 3.3). An accurate knowledge of the binary code, thanks to the control of its generation, also permits at this step to dynamically control the insertion or deletion of some instrumentation instructions. Indeed it is important to manage execution monitoring through sampling based instrumentations in varying frequencies, following the changing behavior frequency (see in [1] and [76] a description of this kind of mechanism), as such instrumentations necessarily induce overheads that have to be minimized. ## 3.4.7. Re-parallelization, thread mutation or rollback Depending on the information collected from instrumentation, and depending on the built prediction models, the profiling phase causes a re-transformation of some code parts, thus causing the mutation of the concerned threads. Such re-transformation is done either on the binary code whether it consists in low level and small modifications, as for example the adjustement of a data prefetching distance, or on the intermediate code if it consists in a complete modification of the parallelizing strategy. For example, such a processing will follow the observation of a bad performance, or of a change in the computing resources availability, or will be caused by the completion of a dependency prediction model allowing the generation of a speculative parallelization. From such a speculative execution, a re-transformation can consist in rolling back to a sequential execution version when the considered hypothetical parallelism, and thus the associated prediction model, has been evaluated wrong. ## 3.5. Proof of program transformations for multicores Participants: Éric Violard, Julien Narboux, Nicolas Magaud, Vincent Loechner, Alexandra Jimborean. ## 3.5.1. State of the art #### 3.5.1.1. Certification of low-level codes. Among the languages allowing to exploit the power of multicore architectures, some of them supply the programmer a library of functions that corresponds more or less to the features of the target architecture: for example, CUDA<sup>4</sup> for the architectures of type GPGPU and more recently the standard OpenCL<sup>5</sup> that offers a unifying programming interface allowing the use of most of the existing multicore architectures or a use of heterogeneous aggregate of such architectures. The main advantage of OpenCL is that it allows the programmer to write a code that is portable on a large set of architectures (in the same spirit as the MPI library for multi-processor architectures). However, at this low level, the programming model is very close to the executing model, the control of parallelism is explicit. Proof of program correctness has to take into account low-level mechanisms such as hardware interruptions or thread preemption, which is difficult. In [39], Feng *et al.* propose a logic inspired from the Hoare logic in order to certify such low-level programs with hardware interrupts and preempted threads. The authors specify this logic by using the meta-logic implemented in the Coq proof assistant [25]. ## 3.5.1.2. Certification of a compiler. The problem here is to prove that transformations or optimizations preserve the operational behaviour of the compiled programs. Xavier Leroy in [28], [53] formalizes the analyses and optimizations performed by a C compiler: a big part of this compiler is written in the specification language of Coq and the executable (Caml) code of this compiler is obtained by automatic extraction from the specification. Optimizing compilers are complex softwares, particularly in the case of multi-threaded programs. They apply some subtle code transformations. Therefore some errors in the compiler may occur and the compiler may produce incorrect executable codes. Work is to be done to remedy this problem. The technique of validation *a posteriori* [72], [73] is an interesting alternative to full verification of a compiler. #### 3.5.1.3. Semantics of directives. As it was mentioned in subsection 3.2.3, the use of directives is an interesting approach to adapt languages to multicore architectures. It is a syntactic means to tackle the increasing need of enriching the operational semantics of programs. Ideally, these directives are only comments: they do not alter the correction of programs and they are a good means to improve their performance. They allow the separation of concerns: *correction* and *efficiency*. However, using directives in that sense and in the context of automatic parallelization, raises some questions: for example, assuming that directives are not mandatory, how to ensure that directives are really taken into account? How to know if a directive is better than another? What is the impact of a directive on performance? <sup>&</sup>lt;sup>4</sup>http://www.nvidia.com/object/cuda\_what\_is.html <sup>5</sup>http://www.khronos.org/opencl In his thesis [41], that was supervised by Éric Violard, Philippe Gerner addresses similar questionings and states a formal framework in which the semantics of compilation directives can be defined. In this framework, any directive is encoded into one equation which is added to an algebraic specification. The semantics of the directives can be precisely defined via an order relation (called relation of *preference*) on the models of this specification. #### 3.5.1.4. Definition of a parallel programming model. Classically, the good definition of a programming model is based on a semantic domain and on the definition of a "toy" language associated with a proof system, which allows to prove the correctness of the programs written in that language. Examples of such "toy" languages are CSP for control parallelism and $\mathcal{L}$ [29] for data parallelism. The proof systems associated with these two languages, are extensions of the Hoare logic. We have done some significant works on the definition of data parallelism [11]. In particular, a crucial problem for the good definition of this programming model, is the semantics of the various syntactic constructs for data locality. We proposed a semantic domain which unifies two concepts: *alignment* (in a data-parallel language like HPF) and *shape* (in the data-parallel extensions of C). We defined a "toy" language, called PEI, that is made of a small number of syntactic constructs. One of them, called *change of basis*, allows the programmer to exhibit parallelism in the same way as a placement or a scheduling directive [42]. #### 3.5.1.5. Programming models for multicore architectures. The multicore emergence questions the existing parallel programming models. For example, with the programming model supported by OpenMP, it is difficult to master both correctness and efficiency of programs. Indeed, this model does not allow programmers to take optimal advantage of the memory hierarchy and some OpenMP directives may induce unpredictable performances or incorrect results. Nowadays, some new programming models are experienced to help at designing both efficient and correct programs for multicores. Because memory is shared by the cores and its hierarchy has some distributed parts, some works aim at defining a hybrid model, between task parallelism and data parallelism. For example, languages like UPC (Unified Parallel C)<sup>6</sup> or Chapel<sup>7</sup> combine the advantages of several programming paradigms. In particular, the model of memory transactions (or transactional memory [50]) retains much attention since it offers the programmer a simple operational semantics including a mutual exclusion mechanism which simplifies program design. However, much work remains to define the precise operational meaning of transactions and the interaction with the other languages features [59]. Moreover, this model leaves the compiler a lot of work to reach a safe and efficient execution on the target architecture. In particular, it is necessary to control the atomicity of transactions [40] and to prove that code transformations preserve the operational semantics. #### 3.5.1.6. Refinement of programs. Refinement [23], [43] is a classical approach for gradually building correct programs: it consists in transforming an initial specification by successive steps, by verifying that each transformation preserves the correctness of the previous specification. Its basic principle is to derive simultaneously a program and its own proof. It defines a formal framework in which some rules and strategies can be elaborated to transform specifications written by using the same formalism. Such a set of rules is called a *refinement calculus*. Unity [33] and Gamma [24] are classical examples of such formalisms, but they are not especially designed for refining programs for multicore architectures. Each of these formalisms is associated with a computing model and thus each specification can be viewed as a program. Starting with an initial specification, a proof logic allows a user to derive a specification which is more suited to the target architecture. <sup>&</sup>lt;sup>6</sup>http://upc.gwu.edu <sup>&</sup>lt;sup>7</sup>http://chapel.cs.washington.edu Refinement applies for the programming of a large range of problems and architectures. It allows to pass the limitations of the polyhedral model and of automatic parallelization. We designed a refinement calculus to build data parallel programs [74]. #### 3.5.2. Main objective: formal proof of analyses and transformations Our main objective consists in certifying the critical modules of our optimization tools (the compiler and the virtual machine). First we will prove the main loop transformation algorithms which constitute the core of our system. The optimization process can be separated into two stages: the transformations consisting in optimizing the sequential code and in exhibiting parallelism, and those consisting in optimizing the parallel code itself. The first category of optimizations can be proved within a sequential semantics. For the other optimizations, we need to work within a concurrent semantics. We expect the first stage of optimizations to produce data-race free code. For the second stage of optimizations, we will first assume that the input code is data-race free. We will prove those transformations using Appel's concurrent separation logic [45]. Proving transformations involving program which are not data-race free will constitute a longer term research goal. ## 3.5.3. Proof of transformations in the polyhedral model The main code transformations used in the compiler and the virtual machine are those carried out in the polyhedral model [52], [38]. We will use the Coq proof assistant to formalize proofs of analyses and transformations based on the polyhedral model. In [32], Cachera and Pichardie formalized nested loops in Coq and showed how to prove *properties* of those loops. Our aim is slightly different as we plan to prove *transformations* of nested loops in the polyhedral model. We will first prove the simplest unimodular transformations, and later we will focus on more complex transformations which are specific to multicore architectures. We will first study scheduling optimizations and then optimizations improving data locality. #### 3.5.4. Validation under hypothesis In order to prove the correction of a code transformation T it is possible to: - prove that T is correct in general, i.e., prove that for all x, T(x) is equivalent to x. - prove a posteriori that the applied transformation has been correct in the particular case of a code c. The second approach relies on the definition of a program called *validator* which verifies if two pieces of program are equivalent. This program can be modeled as a function V such that, given two programs $c_1$ and $c_2$ , $V(c_1, c_2) = true$ only if $c_1$ has the same semantics as $c_2$ . This approach has been used in the field of optimizations certification [62], [61]. If the validator itself contains a bug then the certification process is broken. But if the validator is proved formally (as it was achieved by Tristan and Leroy for the Compcert compiler [72], [73]) then we get a transformed program which can be trusted in the same way as if the transformation is proved formally. This second approach can be used only for the *effective parallelism*, when the static analysis provides enough information to parallelize the code. For the *hypothetical parallelism*, the necessary hypotheses have to be verified at run time. For instance, the absence of aliases in a piece of code is difficult to decide statically but can be more easily decided at run time. In this framework, we plan to build a *validator under hypotheses*: a function V' such that, given two programs $c_1$ and $c_2$ and an hypothesis H, if $V'(c_1, c_2, H) = true$ , then H implies that $c_1$ has the same semantics as $c_2$ . The validity of the hypothesis H will be verified dynamically by the virtual machine. This verification process, which is part of the virtual machine, will have to be proved as correct as well. ## 3.5.5. Rejecting incorrect parallelizations The goal of the project is to exhibit potential parallelism. The source code can contain many sub-routines which could be parallelized under some hypothesis that the static analysis fails to decide. For those optimizations, the virtual machine will have to verify the hypotheses dynamically. Dynamically dealing with the potential parallelism can be complex and costly (profiling, speculative execution with rollbacks). To reduce the overhead of the virtual machine, we will have to provide efficient methods to rule out quickly incorrect parallelism. In this context, we will provide hypotheses which are easy to check dynamically and which can tell when a transformation cannot be applied, *i.e.*, hypotheses which are sufficient conditions for the non-validity of an optimization. ## 4. Application Domains ## 4.1. Application Domains Performance being our main objective, our developments' target applications are characterized by intensive computation phases. Such applications are numerous in the domains of scientific computations, optimization, data mining and multimedia. Applications involving intensive computations are necessarily high energy consumers. However this consumption can be significantly reduced thanks to optimization and parallelization. Although this issue is not our prior objective, we can expect some positive effects for the following reasons: - Program parallelization tries to distribute the workload equally among the cores. Thus an equivalent performance, or even a better performance, to a sequential higher frequency execution on one single core, can be obtained. - Memory and memory accesses are high energy consumers. Lowering the memory consumption, lowering the number of memory accesses and maximizing the number of accesses in the low levels of the memory hierarchy (registers, cache memories) have a positive consequence on execution speed, but also on energy consumption. ## 5. Software ## 5.1. PolyLib PolyLib <sup>8</sup> is a C library of polyhedral functions, that can manipulate unions of rational polyhedra of any dimension, through the following operations: intersection, difference, union, convex hull, simplify, image and preimage. It was the first to provide an implementation of the computation of parametric vertices of a parametric polyhedron, and the computation of an Ehrhart polynomial (expressing the number of integer points contained in a parametric polytope) based on an interpolation method. It is used by an important community of researchers (in France and the rest of the world) in the area of compilation and optimization using the polyhedral model. Vincent Loechner is the maintainer of this software. It is distributed under GNU General Public License version 3 or later, and it has a Debian package maintained by Serge Guelton (Symbiose Projet, IRISA). ## 5.2. ZPolyTrans ZPolyTrans $^9$ is a C library and a set of executables, that permits to compute the integer transformation of a union of parametric $\mathbb{Z}$ -polyhedra (the intersection between lattices and parametric polyhedra), as a union of parametric $\mathbb{Z}$ -polyhedra. The number of integer points of the result can also be computed. It is build upon PolyLib and Barvinok library. This work is based on some theoretical results obtained by Rachid Seghir and Vincent Loechner, that will be published in ACM TACO in 2011 [13]. <sup>&</sup>lt;sup>8</sup>http://icps.u-strasbg.fr/PolyLib <sup>&</sup>lt;sup>9</sup>http://ZPolyTrans.gforge.inria.fr It allows for example to compute the number of solutions of a Presburger formula by eliminating existancial integer variables, or to compute the number of different data accessed by some array accesses contained in an affine parametric loop nest. The authors of this software are Rachid Seghir (Univ. Batna, Algeria) and Vincent Loechner. It is distributed under GNU General Public License version 3 or later. #### 5.3. NLR We have developed a program implementing our loop-nest recognition algorithm, detailed in [7]. This standalone, filter-like application takes as input a raw trace and builds a sequence of loop nests that, when executed, reproduce the trace. It is also able to predict forthcoming values at an arbitrary distance in the future. Its simple, text-based input format makes it applicable to all kinds of data. These data can take the form of simple numeric values, or have more elaborate structure, and can include symbols. The program is written is standard ANSI C. The code can also be used as a library. We have used this code to evaluate the compression potential of loop nest recognition on memory address traces, with very good results. We have also shown that the predictive power of our model is competitive with other models on average. The software is available upon request to anybody interested in trying to apply loop nest recognition. It has been distributed to a dozen of colleagues around the world. We plan on using this software as the base for a new tool we currently design, for the analysis of parallel traces. ## 5.4. Dynamic version selector We are developing a toolchain to automatically select between different versions of parallel loop nests, as described in subsection 6.2. It generates the profiling code and selection code from a loop nest source code and different schedules, expressed in the CLooG format. Benoît Pradelle (PhD) wrote this toolchain, based on python scripts. It is not yet distributed. ## 5.5. Binary files decompiler Our research on efficient memory profiling has lead us to develop a sophisticated decompiler. This tool analyzes x86-64 binary programs and libraries, and extracts various structured representations of the code. It works on a routine per routine basis, and first builds a loop hierarchy to characterize the overall structure of the algorithm. It then puts the code into Static Single Assignment (SSA) form to highlight the fine-grain data-flow between registers and memory. Building on these, it performs the following analyzes: - All memory addresses are expressed as symbolic expressions involving specific versions of register contents, as well as loop counters. Loop counter definitions are recovered by resolving linearly incremented registers and memory cells, i.e., registers that act as induction variables. - Most conditional branches are also expressed symbolically (with registers, memory contents, and loop counters). This captures the control-flow of the program, but also helps in defining what amounts to loop "trip-counts", even though our model is slightly more general, because it can represent any kind of iterative structure. This tool embodies several passes that, as far as we know, do not exist in any existing similar tool. For instance, it is able to track data-flow through stack slots in most cases. It has been specially designed to extract a representation that can be useful in looking for parallel (or parallelizable) loops [48]. It is the basis of several of our studies. Because binary program decompilation is especially useful to reduce the cost of memory profiling, our current implementation is based on the Pin binary instrumenter. It uses Pin's API to analyze binary code, and directly interfaces with the upper layers we have developed (e.g., program skeletonization, or minimal profiling). However, we have been careful to clearly decouple the various layers, and to not use any specific mechanism in designing the binary analysis component. Therefore, we believe that it could be ported with minimal effort, by using a binary file format extractor and a suitable binary code parser. It is also designed to abstract away the detailed instruction set, and should be easy to port (even though we have no practical experience in doing so). We feel that such a tool could be useful to other researchers, because it makes binary code available under abstractions that have been traditionally available for source code only. If sufficient interest emerges, e.g., from the embedded systems community, or from researchers working on WCET, or from teams working on software security, we are willing to distribute and/or to help make it available under other environments. ## 5.6. Dynamic dependency analyser We have recently started developing a dynamic dependence analyzer. Such a tool consumes the trace of memory (or object) accesses, and uses the program structure to list all the data dependences appearing during execution. Data dependences in turn are central to the search for parallel sections of code, with the search for parallel loops being only a particular case of the general problem. Most current works of these questions are either specific to a particular analysis (e.g., computing dependence densities to select code portions for thread-level speculation), or restricted to particular forms of parallelism (e.g., typically to fully parallel loops). Our tool tries to generalize existing approaches, and focuses on the program structures to provide helpful feedback either to a user (as some kind of "smart profiler"), or to a compiler (for feedback-directed compilation). For example, the tool is able to produce a dependence schema for a complete loop nest (instead of just a loop). It also targets irregular parallelism, for example analyzing a loop execution to estimate the expected gain of parallelization strategies like inspector-executor. We have developed this tool in relation to our minimal profiling research project. However, the tool itself has been kept independent of our profiling infrastructure, getting data from it via a well-defined trace format. This intentional design decision has been motivated by our work on distinct execution environments: first on our usual x86-64 benchmark programs, and second on less regular, more often written in Java, real-world applications. The latter type of applications is likely the one that will most benefit from such tools, because their intrinsic execution environment does not offer enough structure to allow effective static analysis techniques. Parallelization efforts in this context will most likely rely on code annotations, or specific programming language constructs. Programmers will therefore need tools to help them choose between various constructs. Our tool has this ambition. We already have a working tool-chain for C/C++/Fortran programs (or any binary program). We are in the process of developing the necessary infrastructure to connect the dynamic dependence profiler to instrumented Java programs. Other managed execution environments could be targeted as well, e.g., Microsoft's .Net architecture, but we have no time and/or workforce to devote to such time-consuming engineering efforts. #### 5.7. VMAD software and LLVM For dynamic analysis and optimization of programs, we are developing a virtual machine called VMAD, and specific passes to the LLVM compiler suite, plus a modified Clang frontend. It is fully described in subsection 6.1. We implemented for now a memory access predictor in loop nests, based on the computation of linear interpolation functions. The profiling is very fast compared to other existing tools, as it samples only the first few iterations of each loop in the nest, then it is deactivated to return to the original, faster version. Other tools like PIN or PEBIL do not support such activation/deactivation mechanism. New annotations for the final user, taken as input by LLVM, and new VMAD modules will be developed, as these tools have been designed to be very evolving. Alexandra Jimborean (PhD), Matthieu Herrmann (Master student) and Luis Mastrangelo (Master student) are the main contributors of this software. It is not yet distributed. ## 5.8. Polyhedral prover Participants: Nicolas Magaud, Julien Narboux, Éric Violard [correspondant]. We are currently developing a formal proof of program transformations based on the polyhedral model. We use the CompCert verified compiler [54] as a framework. This tool is written in the specification language of Coq. It is connected to the activity described in section 6.6. ## 6. New Results ## 6.1. VMAD and LLVM The goal is to provide a set of annotations (pragmas) that the user can insert in the source code to perform low level analyses (profiling) or optimizations (dynamic parallelization for example). We are developing a virtual machine handling advanced dynamic analyses and transformations of programs. VMAD is organized as a sequence of basic operations, where external modules associated to specific strategies are dynamically loaded when required. The program binary files handled by VMAD are previously instrumented at compile time to include necessary data, instrumentation instructions and callbacks to the virtual machine. Dynamic information, such as memory locations of launched modules, are patched at startup in the binary file. The LLVM compiler has been extended to automatically instrument programs to meet the requirements both of VMAD and of the handled/chosen analysis and transformation strategies. VMAD uses sampling and multi-versioning to limit the instrumentations time overhead. At runtime, targeted codes are launched by successive chunks that can be either original, instrumented or optimized/parallelized versions. After each chunk execution, decisions can be taken relatively to the current optimization strategy. At this time, VMAD is handling advanced memory access profiling through linear interpolation of the addresses, dynamic dependency analysis and version selction. The last developments are focusing on speculative polyhedral parallelization. The profiling strategy interpolating the memory addresses accessed in a loop nest has been run on some of the SPEC2006 and Pointer Intensive benchmark suites, showing a very low time overhead, in most cases. More details are available in the research reports [46] and [47], and publications [15] and [16]. ## **6.2. Dynamic version selector** Adaptive version selection between different parallel versions of code is necessary when the execution context of a program is not known. The execution contexts includes all or some of these possibly variable parameters: the target architecture, the load of the computer at execution time, and the input data. We have developed a framework handling loops in the polyhedral model, that is able to take a runtime decision about which version to execute. It is based on: - the generation of different code versions of a loop nest; - an install-time profiling to take into account the architecture parameters, that builds a parametric ranking table between the versions; - a runtime selection, predicting the load balance and the execution time of each code version, before executing the best one. We showed that different versions of a code are required on several polyhedral loop nest benchmarks, depending on both the target architecture and the input data. And we showed speedups compared to any statically chosen version in all execution contexts. More details are available in publication [19]. ## 6.3. Binary parallelization Our work on parallelizing binary programs has continued in 2011, with several new results. The general principle is to analyze the binary code and extract a model of the most intense loops. The model has to include everything that is related to memory access, and also some part of the computations done in registers. Once a suitable model is extracted, it can be used to derive a new scheduling for each targeted loops, optimizing various criteria: this is where polyhedral techniques are used, providing algorithms to optimize locality, parallelism, or both at the same time. After a new scheduling is computed, the transformed code is generated by a polyhedron-scanning algorithm. Our approach relies on an intermediate representation whose emphasis is on memory accesses, hiding, i.e., *outlining* all low-level details and retaining only what is needed by the parallelization component: we use raw C constructs, and macros to denote outlined code. Starting with the executable program, a first phase raises the code into our intermediate representation. The second phase uses a stock parallelizing component, producing a transformed C programs. The last phase lowers this intermediate representation into a new binary executable. The system then uses a run-time monitoring component (generated automatically at the same time as the parallel version) that redirects execution to the transformed loops whenever appropriate. This year's activity on this topic in our team has started by finalizing and presenting a paper at the IMPACT workshop [20], held during CGO'2011, in Chamonix (France). This workshop focuses on tools and techniques based on the polyhedral model. It has been interesting to hear the various reactions and remarks of researchers attending our presentation: the general position is that our work opens new perspectives on the use of the polyhedral model, and avoids to need to have a complete polyhedral tool-chain. Another major aspect is the fact that our 3-phase strategy clearly separates the polyhedral part of the whole process, in essence providing a basis for a polyhedral programming language that is slightly more general that what was considered before. The work on this topic has continued on two directions. The first was to effectively abstract the parallelizing phase. This has been done by demonstrating the use of two distinct parallelizers: the first is PLUTO, a polyhedral locality optimizer and parallelizer, and the second is CETUS, a "simple" parallelizer. The second new direction was directed by the complexity of typical "real-world" executable programs. It has consisted in developing new dependence analysis and parallelization techniques, handling more general classes of programs. This current is currently submitted for publication. These research results will be presented at the forthcoming *HiPEAC* conference, to be held in Paris in January 2012. A full-length paper has been accepted for publication in *ACM Transactions on Architecture and Code Optimization* some time in 2012. ## 6.4. Modeling the dynamic behavior of executable programs Modeling the dynamic behavior of a given program is useful for several reasons. First, it is a particular way of profiling the program, targeting non trivial characteristics of the execution. As such, it helps programmers understand the behavior of the program, and hopefully helps them optimizing it. Second, the results can be used in a compiler, using run-time information to drive static optimizations. This path has seen considerable development when it comes, for example, to sequence basic blocks so as to leverage branch predictors and/or optimize the usage of instruction caches. We ambition to take it one step further, using run-time information to help a compiler in the task of auto-parallelization. Third, modeling can be used on-line, during the execution of the program, to drive the use of dynamic optimizations. Our first achievement in 2011 has been the presentation of our paper at the International Symposium on Performance Analysis of Software and Systems [17]. The paper describes an approach that e have called *program skeletonization*. The basic idea is to perform a static analysis of the code under scrutiny, and locate a small number of register assignments that completely determine the set of memory addresses the program will access. By instrumenting these elementary value assignments and extracting the ensuing computations of addresses, the amount of instrumentation can be dramatically reduced, at the cost of offloading some computations to the profiler. On average, this provides a significant gain in the time needed to obtain a memory trace, and is independent on the particular application, e.g., cache simulation, data race detection, and so on. Our second main research direction on this topic has been the characterization of semi-regular memory accesses. Semi-regular accesses are caused by the traversal of a data structure linking successive memory cells in no particular memory order, i.e., a linked list or a tree. We have developed a modeling algorithm that is able to detect that a set of instructions perform irregular accesses that are actually highly correlated, differing only by an affine function of the enclosing loop indices. This has several implications in terms of potential optimizations. First it exhibits a kind of abstract iterator, that can later be handled, e.g., by inspector/executor techniques. Second, it reduces the number of potential dependencies that have to be tested. The third, most recent, research project that we have started this year is the analysis of traces of parallel programs, currently MPI programs. Parallel traces offer new research challenges, because they contain events that are only partially ordered. We have extended our loop nest recognition algorithm [7] to handle parallel traces. Our preliminary results show that this algorithm is highly effective in extracting communication patterns. This has a number of potential applications that we plan to study in the coming months. ## 6.5. Dynamic dependence analysis We have started a research project on dynamic dependence analysis. The principle is to observe an execution of a given program and collect dependence information. This form of profiling is similar to memory profiling, except that to goal is to directly produce data dependencies. During collection, data dependencies are abstracted into dependence graphs which, in turn, give enough information to decide whether a given code portion is parallel and, if the answer is positive, precisely constrain the set of applicable program transformations. Our implementation currently uses our own profiling infrastructure to obtain data from a running program, and models the sequence of run-time data dependencies in a unique framework. The system is of course sensitive to the fact that the input data is representative of typical inputs: the conclusion it draws cannot be applied without resorting to speculation. Currently, we restrict the system to a *parallelization assistant*, whose result is a set of suggestion that the programmer is free to follow or ignore. A fundamental characteristic of a parallelization assistant is the class of parallel constructions that it is able to extract from run-time data. We have designed the profiling component and the analysis algorithm such as to be able to represent the dependencies with various abstractions, from simple boolean dependencies to full dependence polyhedra. However, even the simplest dependence model provides useful hints to parallelize a given program. In this case, our tool flags every loop in the program as either intrinsically sequential, or potentially parallel. To make this widely usable, we have decided to target common parallel programing constructions, namely OpenMP directives. In turns out that our design is flexible enough to let the system also target loops that become parallel after simple privatization transformations. The result of profiling is thus a set of complete OpenMP directives, making the loop parallel but also instructing the compiler to take special measure for local data that (falsely) renders loop iterations dependent. Because the general design of our dynamic dependence analysis system is generic and modular, we plan to develop and distribute software implementing our approach. This project, called Parwiz, has received support from INRIA under the form of an ADT (*Action de Développement Technologique*). The funding should let us hire an engineer for two years. Unfortunately, we have not been able to find a suitable candidate in 2011, and plan to continue the recruitment process in 2012. In the meantime, we have supervised en Argentinian student, José Cacherosky, during 4 months (from July to October 2011), as an intern with funding from the INRIA *International Internships* programs. The goal of the internship was to extend our framework (developed mainly to work from binary programs) to another execution environment, namely the *Java Virtual Machine*. Most of the time has been spent on developing an instrumentation infrastructure, but José Cacherosky has also started the implementation of some of the parallelism detection algorithms. We plan to continue this work soon, and provide a tool that could work in several, very distinct environments. Finally, Fabrice Rastello, research scientist at the *École Normale Supérieure de Lyon* (with the INRIA team COMPSYS), has expressed interest in collaborating with us on this research project. ## 6.6. Dealing with arithmetic overflows in the polyhedral model Participants: Nicolas Magaud, Julien Narboux, Éric Violard. Our goal in collaboration with Alexandre Pilkiewicz, PhD student, and François Pottier, senior researcher at INRIA, is to prove formally the correctness of a compiler based on the polyhedral model and to integrate it in the Compcert compiler. But as the polyhedral transformations apply to affine loop nests in a mathematical framework where each loop variable is considered to be a mathematical integer, and not a machine integer, we must therefore warrant that no arithmetic overflow occurs when the considered loop nests are executed. We proposed a solution to produce a compiler which does not ignore the problem of overflows. Our solution consists in generating a formula which captures the presence of overflows in the program, then asking to an external tool (i.e. the iscc calculator), for a sufficient condition about the parameters which implies the absence of overflows. Finally we check this condition dynamically. If the condition holds we can use the optimized version of the program. If it does not, in order to preserve the semantics of the program we keep the original version. Figure 3 illustrates our solution for overcoming the problem of arithmetic overflows and for ensuring the correctness of polyhedral transformations. In addition to the polyhedral optimizer, our compiler uses an oracle and a validator. The oracle returns a boolean expression (b) which denotes a sufficient condition to ensure that both the original program (org) and the optimized program (opt) do not produce any overflow. Our transformation then builds a program that we call the resulting program, of the shape If b then opt else org. It dynamically evaluates the boolean expression b and executes org, i.e. the original program, if the condition is not fulfilled or opt, i.e. the optimized program, if the condition is true. The resulting program is then transmitted to the validator. The validator is a function which takes the original program (org), the optimized program (opt) and the resulting program, and returns a boolean: if it returns true, then the resulting program is equivalent to the original one and our compiler therefore produces the resulting program. We now have to formally prove the validator using the Coq proof assistant. Figure 3. A solution for overcoming the problem of arithmetic overflows ## 7. Partnerships and Cooperations ## 7.1. National Initiatives Philippe Clauss, Alain Ketterlin and Vincent Loechner are involved in the proposition of an INRIA Large Wingspan Project (*Action d'Envergure Nationale*) entitled "Software for multicores and hardware accelerators" and regrouping several french teams doing researches in compilers, parallel computing and program optimization. Philippe Clauss shares the head of the project with Gilles Muller of the INRIA REGAL team. A new version of the project will be submitted to INRIA at the end of 2011. ## 7.2. International Initiatives #### 7.2.1. INRIA Associate Teams 7.2.1.1. ANCOME Title: Memory and applications memory behavior INRIA principal investigator: Philippe Clauss **International Partner:** Institution: Universidad de Buenos Aires (Argentina) Laboratory: Departamento de Computacion, Facultad de Ciencias Exactas y Naturales Duration: 2011 - 2013 See also: http://lafhis.dc.uba.ar/wiki/index.php/EA-Ancome This associate team focuses on developing original methods for the analysis of programs memory behavior, in particular in the context of applications using dynamic memory allocation. The proposed approaches consists of analyzing and modeling the runtime behavior, where extracted properties are then verified thanks to static analysis processes. Thus pure static approaches limits are overpassed. Further, the case of multi-threaded applications run on multi-core architectures is studied in order to elaborate and extend our analysis techniques and to extract properties specific to this context. The issues are mainly concerned with the conception of real-time applications using dynamic memory allocation. #### 7.2.2. INRIA International Partners Rachid Seghir, researcher and teacher at the University of Batna, Algeria, works in close collaboration with Vincent Loechner. He is the co-author of the ZPolyTrans software and of a forthcoming paper in the journal ACM Transactions on Architecture and Code Optimization [13]. Benoît Meister, Managing Engineer at Reservoir Labs, New York, USA, has collaborated with Vincent Loechner and Rachid Seghir on a forthcoming publication [13]. Jean Christophe Beyler, Senior HPC Engineer at Intel and in the International Exascale project, University of Versailles, France, is experimenting energy saving strategies using the VMAD framework of the CAMUS team. Sven Verdoolaege, researcher at the Leiden Institute of Advanced Computer Science, Leiden, The Netherlands, has collaborated with Philippe Clauss, Vincent Loechner and Diego Garbervetsky in the writing of a chapter of the book entitled *Energy-Aware Memory Management for Embedded Multimedia Systems: A Computer-Aided Approach* [21]. ## 7.2.3. Visits of International Scientists #### 7.2.3.1. Visits Sergio Yovine (from Jun 6 2011 until Jun 12 2011) Institution: Universidad de Buenos Aires (Argentina), EA INRIA Ancome Diego Garbervetsky (from Apr 1 2011 until Apr 10 2011 and from Aug 1 2011 until Sept 2 2011) Institution: Universidad de Buenos Aires (Argentina), EA INRIA Ancome #### 7.2.3.2. Internships Luis Mastrangelo (from Mar 2011 until Aug 2011) Subject: A Virtual Machine for Automatic Program Parallelization Institution: Universidad de Buenos Aires (Argentina) Bruno Cuervo Parrino (from May 2011 until Oct 2011) Subject: Formalizing a new validation mechasnism under assumptions for speculative parallelism Institution: Universidad de Buenos Aires (Argentina) Jose Cacherosky (from Jul 2011 until Dec 2011) Subject: Dynamic dependence profiling for Java Institution: Universidad de Buenos Aires (Argentina) #### 7.2.4. Participation In International Programs The CAMUS team is associated to the CNRS-CONICET Associated International Laboratory France-Argentina *INFINIS* <sup>10</sup> (INformatique Fondamentale, logIque, laNgages, vérIfication et Systèmes) inaugurated in December 2011. ## 8. Dissemination ## 8.1. Animation of the scientific community Vincent Loechner is one of the organizers of the *Second International Workshop on Polyhedral Compilation Techniques*, *IMPACT 2012* <sup>11</sup>, that will be held in January 2012 in Paris, in conjunction with the international conference HiPEAC. He also participates, with Philippe Clauss, to its program committee. Alexandra Jimborean presented her work about code tracking and cloning in the LLVM compiler at the *First LLVM European User Group Meeting* <sup>12</sup> in London, organized and sponsored by ARM Ltd., Sept. 16 2011. Philippe Clauss has reviewed papers for the international conference DATE 2012, for the international workshop IMPACT 2011 and the french journal TSI. He has also reviewed projects for the ANR, and the Ile-de-France and Aquitaine Regions. <sup>10</sup>https://dri-dae.cnrs-dir.fr/spip.php?article3009 <sup>11</sup> http://impact.gforge.inria.fr/impact2012 <sup>12</sup> http://llvm.org/devmtg/2011-09-16 Philippe Clauss participated to the following PhD jurys in 2011: | Date | Candidate | Place | Role | |----------|-------------------|------------------|----------| | Feb. 14 | Khamar SAJJAD | Univ. Versailles | Rewiever | | Jul. 4 | Konrad TRIFUNOVIC | Univ. Paris-Sud | Rewiever | | Sept. 30 | Antoniu POP | ENSM Paris | Rewiever | | Nov. 10 | Khaled HAMIDOUCHE | Univ. Paris-Sud | Rewiever | Julien Narboux participated to the PhD jury of Tuan Minh Pham (INRIA Sophia-Antipolis) in 2011 as an examiner. ## 8.2. Teaching • Philippe Clauss, Université de Strasbourg, France | Intitulé du cours | Nombre d'heures | Niveau | |----------------------------------------|-----------------|----------| | Architecture des ordinateurs | 27 | L2 | | Compilation | 64 | M1 | | Compilation avancée | 30 | M1 | | Système et programmation temps-réel | 25 | M1 | | OS embarqués et réseaux industriels et | 35 | M2 | | de terrain | | | | Parallélisme | 15 | M2 | | 1 Encadrement TER | 10 | M1 | | 1 Encadrement stage Master | 50 | M2 | | 2 Encadrements doctorat | 200 | Doctorat | • Vincent Loechner, Université de Strasbourg, France | Intitulé du cours | Nombre d'heures (éq.TD) | Niveau | |--------------------------------------|-------------------------|-------------------| | Structures de données et algorithmes | 15 | L2 | | Programmation système et réseau | 52 | L2 | | Fondements des systèmes | 38 | L3 | | d'exploitation | | | | Langage interprété | 40 | M1 | | Compilation avancée | 8 | M1 | | Parallélisme | 15 | M2 | | Parallélisme | 30 | école d'ingénieur | | Encadrement TER (1) | 10 | M1 | | Encadrement stage Master (1) | 50 | M2 | | Encadrement doctorat (1) | 100 | Doctorat | • Alain Ketterlin, Université de Strasbourg, France | Intitulé du cours | Nombre d'heures | Niveau | |-----------------------------|-----------------|--------| | Bases de données avancées | 40 | M2 | | Ingénierie de la preuve | 18 | M1 | | Programmation fonctionnelle | 22 | L2 | | Encadrement d'apprentis | 20 | M1 | • Julien Narboux, Université de Strasbourg, France | Intitulé du cours | Nombre d'heures (éq. TD) | Niveau | |---------------------------------------|--------------------------|--------| | Logique et Programmation Logique | 61 | L2 | | Logique | 34 | la | | Modèles de Calcul | 40 | L1 | | Techniques de Programmation | 45 | L2 | | Ingénierie de la preuve | 27 | M1 | | Certification du Logiciel | 36 | M2 | | Constructions et Preuves en Géométrie | 6 | M2 | | 1 Encadrement Projet 150h | 10 | M1 | | Encadrement stage ENSIIE | 30 | la 1a | | Encadrement stage Master | 50 | M2 | • Éric Violard, Université de Strasbourg, France | Intitulé du cours | Nombre d'heures (éq. TD) | Niveau | |--------------------------------|--------------------------|--------| | Programmation Fonctionnelle | 61 | L2 | | Programmation Système & Réseau | 36 | L2 | | Compilation | 42 | M1 | | Sémantique | 45 | M1 | • Nicolas Magaud, Université de Strasbourg, France | - Tricolas triagada, Oniversite de Strasbourg, France | | | | |-------------------------------------------------------|--------------------------|-----------------------|--| | 192 heures statutaires en L2 informatique, | Nombre d'heures (éq. TD) | Niveau | | | L3 math, Master Informatique Intitulé du | | | | | cours | | | | | Projet Professionnel Personalisé | 20 | L2 | | | Structures de Données - Algorithmes | 33 | L2 | | | Ingénierie de la preuve | 12 | M1 | | | Preuve et Certification du Logiciel | 36 | M2 | | | Algorithmique et Programmation | 46 | ENSIIE-Strasbourg 1A | | | Algorithmique et Structures de données | 30 | L3 Math | | | Génie Logiciel | 30 | M1 Chemo-informatique | | #### PhD: PhD: Benoît Pradelle, *Static and Dynamic Polyhedral Compilation Methods for the Execution in Multi-core Environments* [12], Université de Strasbourg, December the 20th 2011, Philippe Clauss PhD in progress: Alexandra Jimborean, *Advanced parallelization tools for multicore programming*, October 2009, Philippe Clauss and Vincent Loechner PhD in progress: Jean-François Dollinger, *Dynamic optimizations for GPUs*, September 2011, Philippe Clauss and Vincent Loechner ## 9. Bibliography ## Major publications by the team in recent years - [1] J. C. BEYLER, P. CLAUSS. *Performance driven data cache prefetching in a dynamic software optimization system*, in "ICS '07: Proceedings of the 21st annual international conference on Supercomputing", New York, NY, USA, ACM, 2007, p. 202–209, http://doi.acm.org/10.1145/1274971.1275000. - [2] J. C. BEYLER, M. KLEMM, P. CLAUSS, M. PHILIPPSEN. *A meta-predictor framework for prefetching in object-based DSMs*, in "Concurr. Comput.: Pract. Exper.", September 2009, vol. 21, p. 1789–1803, http://dx.doi.org/10.1002/cpe.v21:14. [3] P. CLAUSS. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs, in "ICS '96: Proceedings of the 10th international conference on Supercomputing", New York, NY, USA, ACM, 1996, p. 278–285, http://doi.acm.org/10.1145/237578.237617. - [4] P. CLAUSS, F. J. FERNÁNDEZ, D. GARBERVETSKY, S. VERDOOLAEGE. Symbolic polynomial maximization over convex sets and its application to memory requirement estimation, in "IEEE Transactions on Very Large Scale Integration (VLSI) Systems", Aug 2009, vol. 17, no 8, p. 983-996, http://hal.inria.fr/inria-00504617. - [5] P. CLAUSS, V. LOECHNER. *Parametric Analysis of Polyhedral Iteration Spaces*, in "J. VLSI Signal Process. Syst.", 1998, vol. 19, n<sup>o</sup> 2, p. 179–194, http://dx.doi.org/10.1023/A:1008069920230. - [6] P. CLAUSS, I. TCHOUPAEVA. A Symbolic Approach to Bernstein Expansion for Program Analysis and Optimization, LNCS, Springer, April 2004, vol. 2985, p. 120-133. - [7] A. KETTERLIN, P. CLAUSS. *Prediction and trace compression of data access addresses through nested loop recognition*, in "6th annual IEEE/ACM international symposium on Code generation and optimization", États-Unis Boston, ACM, April 2008, p. 94-103, http://dx.doi.org/10.1145/1356058.1356071, http://hal.inria.fr/inria-00504597/en. - [8] V. LOECHNER, B. MEISTER, P. CLAUSS. *Precise data locality optimization of nested loops*, in "Journal of Supercomputing", January 2002, vol. 21, n<sup>o</sup> 1, p. 37–76, Kluwer Academic Pub.. - [9] V. LOECHNER, D. K. WILDE. *Parameterized Polyhedra and their Vertices*, in "International Journal of Parallel Programming", December 1997, vol. 25, n<sup>o</sup> 6. - [10] S. VERDOOLAEGE, R. SEGHIR, K. BEYLS, V. LOECHNER, M. BRUYNOOGHE. Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions, in "Algorithmica", 2007, vol. 48, no 1, p. 37–66, http://dx.doi.org/10.1007/s00453-006-1231-0. - [11] É. VIOLARD. A Semantic Framework to Address Data Locality in Data Parallel Languages, in "Parallel Computing", 2004, vol. 30, n<sup>o</sup> 1, p. 139-161. ## **Publications of the year** #### **Doctoral Dissertations and Habilitation Theses** [12] B. PRADELLE. Static and Dynamic Polyhedral Compilation Methods for the Execution in Multi-core Environments, Université de Strasbourg, December 2011. #### **Articles in International Peer-Reviewed Journal** [13] R. SEGHIR, V. LOECHNER, B. MEISTER. *Integer Affine Transformations of Parametric Z-polytopes and Applications to Loop Nest Optimization*, in "ACM Transactions on Architecture and Code Optimization", 2011, to appear in 2011., http://hal.inria.fr/inria-00582388/en. ## **International Conferences with Proceedings** [14] J.-D. GENEVAUX, J. NARBOUX, P. SCHRECK. Formalization of Wu's simple method in Coq, in "CPP 2011 First International Conference on Certified Programs and Proofs", Kenting, Taiwan, Province Of China, J.- - P. JOUANNAUD, Z. SHAO (editors), Lecture Notes in Computer Science, Springer-Verlag, December 2011, http://hal.inria.fr/inria-00618745/en. - [15] A. JIMBOREAN, M. HERRMANN, V. LOECHNER, P. CLAUSS. *VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs*, in "IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS", Austin, United States, IEEE, April 2011, http://hal.inria.fr/inria-00544501/en. - [16] A. JIMBOREAN, V. LOECHNER, P. CLAUSS. *Handling Multi-Versioning in LLVM: Code Tracking and Cloning*, in "WIR 2011: Workshop on Intermediate Representations, in conjunction with CGO 2011", Chamonix, France, Florent Bouchez, Sebastian Hack, Eelco Visser, April 2011, http://hal.inria.fr/inria-00572785/en. - [17] A. KETTERLIN, P. CLAUSS. *Efficient Memory Tracing by Program Skeletonization*, in "IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS", Austin, United States, IEEE, April 2011, http://hal.inria.fr/inria-00544497/en. - [18] T. M. PHAM, Y. BERTOT, J. NARBOUX. A Coq-based Library for Interactive and Automated Theorem Proving in Plane Geometry, in "The 11th International Conference on Computational Science and Its Applications (ICCSA 2011)", Santander, Spain, Lecture Notes in Computer Science, Springer-Verlag, 2011, vol. 6785, p. 368-383 [DOI: 10.1007/978-3-642-21898-9\_32], http://hal.inria.fr/inria-00584918/en. - [19] B. PRADELLE, P. CLAUSS, V. LOECHNER. *Adaptive Runtime Selection of Parallel Schedules in the Polytope Model*, in "High Performance Computing Symposium", Boston, États-Unis, ACM/SIGSIM, April 2011, http://hal.inria.fr/inria-00564311/en/. - [20] B. PRADELLE, A. KETTERLIN, P. CLAUSS. Transparent Parallelization of Binary Code, in "First International Workshop on Polyhedral Compilation Techniques, IMPACT 2011, in conjunction with CGO 2011", Chamonix, France, Christophe Alias, Cédric Bastoul, April 2011, http://hal.inria.fr/inria-00572797/en. #### Scientific Books (or Scientific Book chapters) [21] P. CLAUSS, D. GARBERVETSKY, V. LOECHNER, S. VERDOOLAEGE. *Polyhedral Techniques for Parametric Memory Requirement Estimation*, in "Energy-Aware Memory Management for Embedded Multimedia Systems: A Computer-Aided Approach", F. BALASA, D. PRADHAN (editors), Taylor & Francis, 2011. #### References in notes - [22] F. AGAKOV, E. BONILLA, J. CAVAZOS, B. FRANKE, G. FURSIN, M. F. P. O'BOYLE, J. THOMSON, M. TOUSSAINT, C. K. I. WILLIAMS. *Using Machine Learning to Focus Iterative Optimization*, in "CGO '06: Proceedings of the International Symposium on Code Generation and Optimization", Washington, DC, USA, IEEE Computer Society, 2006, p. 295–305, http://dx.doi.org/10.1109/CGO.2006.37. - [23] R. BACK. On the Correctness of Refinement Steps in Program Development, University of Helsinki, 1978. - [24] J.-P. BANÂTRE, D. LE MÉTAYER. *The* GAMMA *Model and its Discipline of Programming*, in "Science of Computer Programming", 1990, vol. 15, n<sup>o</sup> 1, p. 55-79. [25] B. Barras, S. Boutin, C. Cornes, J. Courant, J.-C. Filliatre, E. Gimenez, H. Herbelin, G. Huet, C. Munoz, C. Murthy, C. Parent, C. Paulin-Mohring, A. Saibi, B. Werner. *The Coq Proof Assistant Reference Manual: Version 6.1*, 1997. - [26] M. M. BASKARAN, U. BONDHUGULA, S. KRISHNAMOORTHY, J. RAMANUJAM, A. ROUNTEV, P. SA-DAYAPPAN. A compiler framework for optimization of affine loop nests for GPGPUs, in "ICS '08: Proceedings of the 22nd annual international conference on Supercomputing", New York, NY, USA, ACM, 2008, p. 225–234, http://doi.acm.org/10.1145/1375527.1375562. - [27] C. BASTOUL. *Code Generation in the Polyhedral Model Is Easier Than You Think*, in "PACT'13 IEEE International Conference on Parallel Architecture and Compilation Techniques", Juan-les-Pins, France, 2004, p. 7–16, http://hal.ccsd.cnrs.fr/ccsd-00017260. - [28] Y. BERTOT, B. GRÉGOIRE, X. LEROY. A Structured Approach to Proving Compiler Optimizations Based on Dataflow Analysis, in "TYPES 2004", 2004, p. 66-81. - [29] L. BOUGÉ, Y. LE GUYADEC, G. UTARD, B. VIROT. A Proof System for a Simple Data-Parallel Programming Language, in "IFIP WG 10.3, Applications in Parallel and Distributed Computing", Caracas (Venezuela), North-Holland, April 1994. - [30] M. BRIDGES, N. VACHHARAJANI, Y. ZHANG, T. JABLIN, D. I. AUGUST. Revisiting the Sequential Programming Model for Multi-Core, in "MICRO '07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture", Washington, DC, USA, IEEE Computer Society, 2007, p. 69–84, http://dx.doi.org/10.1109/MICRO.2007.35. - [31] M. BURTSCHER, I. GANUSOV, S. J. JACKSON, J. KE, P. RATANAWORABHAN, N. B. SAM. *The VPC Trace-Compression Algorithms*, in "IEEE Trans. Comput.", 2005, vol. 54, n<sup>o</sup> 11, p. 1329–1344. - [32] D. CACHERA, D. PICHARDIE. *Embedding of Systems of Affine Recurrence Equations in Coq*, in "Proc. of 16th International Conference on Theorem Proving in Higher Order Logics (TPHOLs'03)", Lecture Notes in Computer Science, Springer-Verlag, 2003, no 2758, p. 155–170. - [33] K. CHANDY, J. MISRA. Parallel Program Design: A Foundation, Addison Wesley, 1988. - [34] M. CINTRA, D. R. LLANOS. *Design Space Exploration of a Software Speculative Parallelization Scheme*, in "IEEE Trans. Parallel Distrib. Syst.", 2005, vol. 16, n<sup>o</sup> 6, p. 562–576, http://dx.doi.org/10.1109/tpds.2005.69. - [35] P. FEAUTRIER. *Dataflow analysis of scalar and array references*, in "International Journal of Parallel Programming", 1991, vol. 20, n<sup>o</sup> 1, p. 23–53. - [36] P. FEAUTRIER. Some efficient solutions to the affine scheduling problem, Part 1: one dimensional time, in "International Journal of Parallel Programming", 1992, vol. 21, n<sup>o</sup> 5, p. 313–348. - [37] P. FEAUTRIER. Some efficient solutions to the affine scheduling problem, Part 2: multidimensional time, in "International Journal of Parallel Programming", 1992, vol. 21, n<sup>o</sup> 6. - [38] P. FEAUTRIER. *Automatic Parallelization in the Polytope Model*, in "The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications", Springer-Verlag, 1996, p. 79–103. - [39] X. Feng, Z. Shao, Y. Dong, Y. Guo. *Certifying low-level programs with hardware interrupts and preemptive threads*, in "SIGPLAN Not.", 2008, vol. 43, n<sup>o</sup> 6, p. 170–182, http://dx.doi.org/10.1145/1379022. 1375603. - [40] C. FLANAGAN, S. N. FREUND, J. YI. Velodrome: a sound and complete dynamic atomicity checker for multithreaded programs, in "PLDI '08: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation", New York, NY, USA, ACM, 2008, p. 293–303, http://dx.doi.org/10. 1145/1375581.1375618. - [41] P. GERNER. La sémantique des directives au compilateur : application au parallélisme de données, Université Louis Pasteur, 2002. - [42] P. GERNER, É. VIOLARD. A Theoretical Framework of Data Parallelism and Its Operational Semantics, in "Euro-Par 2000", LNCS, Springer, 2001, vol. 1900, p. 668–677. - [43] E. P. GRIBOMONT. Stepwise refinement and concurrency: the finite-state case, in "Sci. Comput. Program.", 1990, vol. 14, no 2-3, p. 185–228, http://dx.doi.org/10.1016/0167-6423(90)90020-E. - [44] M. HALL, D. PADUA, K. PINGALI. *Compiler research: the next 50 years*, in "Commun. ACM", 2009, vol. 52, no 2, p. 60–67, http://doi.acm.org/10.1145/1461928.1461946. - [45] A. HOBOR, A. W. APPEL, F. Z. NARDELLI. *Oracle Semantics for Concurrent Separation Logic*, in "ESOP", 2008, p. 353-367. - [46] A. JIMBOREAN, M. HERRMANN, V. LOECHNER, P. CLAUSS. A Static-Dynamic Collaborative Framework for Nested Loops Instrumentation and Profiling, Université de Strasbourg, 05 2010, http://hal.inria.fr/inria-00534745/en/. - [47] A. JIMBOREAN, M. HERRMANN, V. LOECHNER, P. CLAUSS. VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs, Université de Strasbourg, 09 2010, http://hal.inria.fr/inria-00534748/en/. - [48] A. KETTERLIN, P. CLAUSS. *Recovering the Memory Behavior of Executable Programs*, in "10th IEEE Working Conference on Source Code Analysis and Manipulation, SCAM", Roumanie Timisoara, IEEE Computer Society Press, Sep 2010, http://hal.inria.fr/inria-00502813. - [49] M. KULKARNI, K. PINGALI, B. WALTER, G. RAMANARAYANAN, K. BALA, L. P. CHEW. *Optimistic parallelism requires abstractions*, in "PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation", New York, NY, USA, ACM, 2007, p. 211–222, http://doi.acm.org/10.1145/1250734.1250759. - [50] J. LARUS, C. KOZYRAKIS. Transactional memory, in "Commun. ACM", 2008, vol. 51, no 7, p. 80–88. - [51] E. A. LEE. *The Problem with Threads*, in "Computer", 2006, vol. 39, n<sup>o</sup> 5, p. 33–42, http://dx.doi.org/10. 1109/MC.2006.180. - [52] C. LENGAUER. Loop Parallelization in the Polytope Model, in "Parallel Processing Letters", 1994, vol. 4, n<sup>o</sup> [53] X. LEROY. Formal verification of a realistic compiler, in "Communications of the ACM", July 2009, To appear. - [54] X. LEROY. The Compcert verified compiler, software and commented proof, January 2010, http://compcert.inria.fr. - [55] S.-W. LIAO, A. DIWAN, R. P. BOSCH, A. GHULOUM, M. S. LAM. SUIF Explorer: an interactive and interprocedural parallelizer, in "PPoPP '99: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming", New York, NY, USA, ACM, 1999, p. 37–48, http://doi.acm.org/10.1145/301104.301108. - [56] V. LOECHNER, B. MEISTER, P. CLAUSS. *Data Sequence Locality: a Generalization of Temporal Locality*, in "Euro-Par 2001", Manchester, UK, Springer, 2001. - [57] V. LOECHNER, C. MONGENET. Communication Optimization for Affine Recurrence Equations using Broadcast and Locality, in "International Journal of Parallel Programming", 2000, vol. 28, n<sup>o</sup> 1. - [58] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, K. Hazelwood. *Pin: building customized program analysis tools with dynamic instrumentation*, in "PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation", New York, NY, USA, ACM, 2005, p. 190–200, http://doi.acm.org/10.1145/1065010.1065034. - [59] K. F. MOORE, D. GROSSMAN. *High-level small-step operational semantics for transactions*, in "POPL '08: Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages", New York, NY, USA, ACM, 2008, p. 51–62, http://dx.doi.org/10.1145/1328438.1328448. - [60] T. MOSELEY, D. A. CONNORS, D. GRUNWALD, R. PERI. *Identifying potential parallelism via loop-centric profiling*, in "CF '07: Proceedings of the 4th international conference on Computing frontiers", ACM, 2007, p. 143–152. - [61] G. C. NECULA. *Translation validation for an optimizing compiler*, in "SIGPLAN Not.", 2000, vol. 35, n<sup>o</sup> 5, p. 83–94, http://doi.acm.org/10.1145/358438.349314. - [62] A. PNUELI, O. SHTRICHMAN, M. SIEGEL. *The Code Validation Tool (CVT) Automatic verification of code generated from synchronous languages*, in "Software Tools for Technology Transfer", 1998, vol. 2. - [63] L.-N. POUCHET, C. BASTOUL, A. COHEN, J. CAVAZOS. *Iterative optimization in the polyhedral model: part II, multidimensional time*, in "PLDI '08: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation", New York, NY, USA, ACM, 2008, p. 90–100, http://doi.acm.org/10.1145/1375581.1375594. - [64] L.-N. POUCHET, C. BASTOUL, A. COHEN, N. VASILACHE. Iterative Optimization in the Polyhedral Model: part I, One-Dimensional Time, in "CGO '07: Proceedings of the International Symposium on Code Generation and Optimization", Washington, DC, USA, IEEE Computer Society, 2007, p. 144–156, http://dx.doi.org/10.1109/CGO.2007.21. - [65] G. D. PRICE, J. GIACOMONI, M. VACHHARAJANI. Visualizing potential parallelism in sequential programs, in "PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques", New York, NY, USA, ACM, 2008, p. 82–90, http://doi.acm.org/10.1145/1454115.1454129. - [66] E. RAMAN, N. VACHHARAJANI, R. RANGAN, D. I. AUGUST. Spice: speculative parallel iteration chunk execution, in "CGO '08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization", New York, NY, USA, ACM, 2008, p. 175–184, http://doi.acm.org/10.1145/ 1356058.1356082. - [67] L. RAUCHWERGER, D. PADUA. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization, in "IEEE Trans. Parallel Distrib. Syst.", 1999, vol. 10, n<sup>o</sup> 2, p. 160–180, http://dx.doi.org/10.1109/71.752782. - [68] R. SEGHIR. Méthodes de dénombrement de points entiers de polyèdres et applications à l'optimisation de programmes, Université de Strasbourg, December 2006. - [69] T. SHERWOOD, E. PERELMAN, G. HAMERLY, B. CALDER. Automatically characterizing large scale program behavior, in "ASPLOS-X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems", New York, NY, USA, ACM, 2002, p. 45–57, http://doi. acm.org/10.1145/605397.605403. - [70] J. SMITH, R. NAIR. Virtual Machines: Versatile Platforms for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design), Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. - [71] C. TIAN, M. FENG, V. NAGARAJAN, R. GUPTA. *Copy or Discard Execution Model For Speculative Parallelization On Multicores*, in "IEEE/ACM 41st International Symposium on Microarchitecture, MICRO 41", November 2008, p. 330-341. - [72] J.-B. TRISTAN, X. LEROY. Formal verification of translation validators: a case study on instruction scheduling optimizations, in "SIGPLAN Not.", 2008, vol. 43, n<sup>o</sup> 1, p. 17–27, http://dx.doi.org/10.1145/1328897.1328444. - [73] J.-B. TRISTAN, X. LEROY. *Verified Validation of Lazy Code Motion*, in "Programming Language Design and Implementation 2009", ACM Press, 2009, To appear. - [74] É. VIOLARD, S. GENAUD, G.-R. PERRIN. *Refinement of Data Parallel Programs in PEI*, in "Proceedings of the IFIP TC 2 WG 2.1 international workshop on Algorithmic languages and calculi", London, UK, UK, Chapman & Hall, Ltd., 1997, p. 107–131. - [75] A. WELC, S. JAGANNATHAN, A. HOSKING. *Safe futures for Java*, in "OOPSLA '05: Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications", New York, NY, USA, ACM, 2005, p. 439–453, http://doi.acm.org/10.1145/1094811.1094845. - [76] Q. Wu, O. Mencer. *Evaluating Sampling Based Hotspot Detection*, in "International Conference on Architecture of Computing Systems, ARCS", March 2009. - [77] B. XIN, W. N. SUMNER, X. ZHANG. *Efficient program execution indexing*, in "PLDI '08: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation", New York, NY, USA, ACM, 2008, p. 238–248, http://doi.acm.org/10.1145/1375581.1375611.