BYMOORE - 2011 - Annual activity report

BYMOORE

BYMOORE - 2011

Team Bymoore

Members

Overall Objectives

Software

Software

New Results

Contracts and Grants with Industry

Grants with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: Overall Objectives

Introduction

Summary

For several decades, it was possible to translate transistor size reduction into faster performing processors. Some additional architecture complexity was involved at each generation, but essentially, programs run on a given processor would run faster on the next-generation processor without modification. Due to power constraints, this evolution has considerably slowed down, leading to multi-core architectures, and raising severe programming (parallelization) challenges. Again due to power constraints, even the multi-core option is now being challenged. And at the same time, just as severe transistor fault issues are coming into play, calling for defect-tolerant architectures.

While the processor and multi-core options should still be pushed and optimized as much as possible, it is also now time to contemplate alternative computing systems designs that are more easily compatible with the power and defects issues in ultra-CMOS technology, as well as alternative silicon technologies or even non-silicon technologies. Companies can hardly afford to explore risky alternative paths, while academics could play an important pioneering and filtering role. Moreover, such computing approaches may induce profound enough changes in architecture and programming approaches that they should be investigated and anticipated far ahead.

We outline a roadmap that follows, adapts and attempts to take advantage of the current and upcoming technology options. Due to power reasons, we believe computing systems will have to shift to customization, making heterogeneous systems (composed of a mix of cores and accelerators) a prime citizen, including for general-purpose computing. As a first step, we feel it is necessary to craft the necessary hardware template and programming environment for such systems. We then want to focus on the design of accelerators; we contemplate fast methods for generating accelerators, but also configurable, and versatile (capable of targeting multiple algorithms) accelerators. We especially seek accelerators which can tolerate defects without even having to identify and disable faulty parts. That leads us to neural networks accelerators, which have inherent robustness properties; another possible contender, which will also be investigated later on, are accelerators based on probabilistic logic. While initially hyped, neural networks have fallen out of favor for several years, but a remarkable convergence of recent trends and innovations make them very interesting accelerator candidates again, with a broad application scope (in 2005, shift to RMS applications which rely on statistical and machine-learning algorithms), state-of-the-art algorithmic properties (in 2006, advent of Deep Networks) and emerging technologies almost ideally suited to their hardware implementation (in 2008, first implementation of a memristor device). We first explore robust digital CMOS implementations, then shift our focus to the more efficient and more inherently defect-tolerant analog CMOS implementations. Beyond CMOS, the memristor is a prime contender among emerging technologies for implementing neural network-based accelerators (and configurable accelerators as well). Beyond silicon, we also investigate other technologies, including hybrid silicon-biological implementations, and leverage recent progress in neurobiology which will allow to expand the application scope of these accelerators.

Overall, our driving goal is to design computing systems that are low-power and defect-tolerant, which have a broad application scope and which can scale across many of upcoming and emerging technologies.

Context

The observation at the root of this project is that we are entering a new era where technology constraints, and/or novel technologies, are already forcing us to consider, possibly drastically, different computing systems. The well-known Moore's law, stating that the size of transistors may be halved every two years or so, has been fulfilled for almost four decades. Like any exponential law, it is bound to stop; however many researchers have incorrectly predicted its end in the past few decades, and we will cautiously avoid betting against the ingenuity of physicists. However, while the transistor size is still regularly decreasing, we are now beyond the point where this regular decrease can be smoothly translated into processor performance improvements because of a set of technology-related constraints.

Power issues. Reducing transistor size has two main benefits: it allows to increase the transistor density (and then the chip capacity at a constant cost), and it allows to decrease the switching time of transistors (enabling higher clock frequencies). For several decades, processor manufacturers essentially leveraged the second property to increase processor performance, mostly taking advantage of the additional transistors to implement the mechanisms necessary to feed very fast cores with instructions and data fast enough. The first major roadblock was hit in 2005 when Intel announced it could no longer count on transistor size reduction to increase clock frequencies because of power dissipation constraints [10] . Each time a transistor switches, it dissipates tiny amounts of power; as the number of switches per second increases, the amount of power dissipated increases as well. It is possible to compensate for the power increase due to higher switching frequencies (and thus higher clock frequencies) by scaling down voltage. However, as transistor size and voltage are scaled down, another source of power dissipation, leakage power, increases and ultimately compensates for the benefit of voltage scaling. As a result, there is a crossing point, that we have reached, where voltage scaling is no longer useful for reducing the total power, and consequently, we can no longer afford to let transistors switch at maximum frequency. So, while transistor size can still be reduced, it is no longer possible to take advantage of their faster switching time due to excessive power dissipation. As a result, most processor manufacturers decided to leverage the increased transistor density only, not the faster transistor switching time, and turned to multi-core processors, instead of large processors with high clock frequency.

Because voltage is no longer scaled down and because the power requirement of a transistor does not decrease proportionally with its size, as we keep reducing the transistor size, the total power requirement of the same chip area (increasing number of transistors) actually increases. Because the power budget of a chip is limited, we are now facing a situation where it will not be possible to activate all transistors at the same time due to power limitations. These unused transistors have been coined Dark Silicon by ARM [17] , the main embedded processor designer (used in all cell phones and many embedded systems). And they constitute a second major roadblock we are fast approaching. This roadblock may simply void the notion of many-cores: while transistor density allows for a large number of cores, it will no longer be possible to activate all cores at the same time.

Programming issues. In any case, the multi-core option was already severely challenged by programming issues: after several decades of research, there is still no easy solution for quickly and efficiently parallelizing a broad set of programs on a large number of cores.

Faults issues. While power issues are threatening the development of conventional architectures (fast cores, then multi-cores), it is still possible to scale down transistor size. However, as transistors get smaller, transistor faults are likely to raise even more challenges in the near future. With increasingly small transistors, it is no longer possible to ensure that all transistors and lines have the same characteristics. Such variations induce different latencies to hardware components, or different power characteristics. The manufacturers have to compensate for these variations either by over-designing or introducing targeted optimizations which further complicate designs.

More importantly, transistors are becoming so small that they become susceptible to transient faults and permanent defects. Transient faults can result in bit flips, where memory bits are inverted, for instance simply due to cosmic rays. Permanent defects can either occur at design time, or even during the lifetime of the chip due to electromigration (the slow transfer of materials, i.e., chip aging, resulting in shorts and opens).

Overall, computing systems based on processors and multi-cores are severely challenged by the evolution of technology, which is no longer the smooth evolution enjoyed over the past four decades. Considering the regular improvement of processor performance has been the driver for a whole part of the economy, this issue has consequences way beyond computing systems.

Approach and Roadmap

Why seeking alternatives to processors ?

Because academia and industry are hugely familiar with and experienced in processor and multi-core design, one can expect significant progress can still be made by engineering solutions around power, defects and programming issues, and no one should discount how far this path can go. However, the pressure from technology is becoming so strong that seeking alternative paths should no longer be discounted either. Moreover, there are simple common sense arguments which further motivate the exploration of alternative paths. As mentioned before, transistor size reduction is bound to stop at some point. When that happens, or when the aforementioned constraints become too severe, computing systems will only keep improving by either resorting to different technologies, and/or different computing principles. The industry cannot afford to explore such risky paths, it is rather the role of academia to take such risks and filter out the most promising paths. Moreover, the transformations induced by alternative technologies or computing principles will require a long time before they mature enough to transfer to industry, so such research should be anticipated well ahead.

Naturally, there are many different possible alternative paths to be explored. In the paragraphs below, we outline the roadmap we intend to pursue, and the rationale for it.

Customization for low-power

Processors, also called, general-purpose processors, are flexible architectures: they can execute any algorithm. But this flexibility comes at a hefty power price: for performing a simple arithmetic or logic operation, a processor has to perform multiple steps, involving multiple power-hungry hardware blocks, in addition to the operation itself. In specialized circuits, also called accelerators or ASICs, there is no such overhead, only the operation is performed. Moreover, the size of specialized circuits is tailored to the task, which can drastically reduce power costs. Overall, custom circuits can perform the same task as processors with one or several orders of magnitude less power, and at the same, or sometimes, better performance. The caveat is naturally flexibility: by definition, a specialized circuit can perform only a single algorithm.

However, if it were possible to cram lots of different accelerators on one chip, then it would be possible to accelerate enough algorithms that most programs would be able to benefit. Now, the transistor density keeps increasing, but we know that only a fraction of transistors can be used simultaneously, i.e., the so-called dark silicon. However, using transistors for accelerator logic, i.e., trading cores for accelerator logic, is a convenient compromise. Unlike multi-cores which need to leverage many cores to speed up the execution of an algorithm, usually only one or a few accelerators are used to speed up the execution. Therefore, not all transistors need to be switched on at the same time, which makes them compatible with the notion of dark silicon.

Note that customization is not a new approach in any way. It has been commonplace in Systems-on-Chips (SoCs) in embedded systems for decades. What we are suggesting is to reconcile customization and flexibility and use accelerators for general-purpose computing by cramming enough and sufficiently flexible accelerators on the same chip.

The exact form of accelerators is open for debate. It can be either multiple specialized circuits, configurable logic, or intermediate solutions like versatile accelerators (useful for multiple, but not all, algorithms). In the case of specialized or configurable circuits, we need to find ways to streamline their design, by automatically generating circuits or automatically mapping high-level code on reconfigurable circuits. Also cores will always be needed for easily performing simple control tasks or for tasks not covered by accelerators, so we are contemplating heterogeneous systems composed of a mix of accelerators and cores, much like SoCs.

However, the key difference and the key pitfall we need to avoid is the lack of programmability, which currently hinders both multi-cores and SoCs; SoCs are notoriously difficult to program, especially to partition tasks among cores and accelerators. But progress in software engineering, especially component-based programming, are offering an elegant solution: for completely different reasons (programming productivity and managing large codes), software engineering practices have encouraged to decompose programs into strictly independent components. Each component tends to be a self-contained algorithm, and moreover, such programming practices are encouraging the reuse of algorithms across programs. Now, a component can either be executed as a software component on a programmable core, or replaced by a call to a hardware accelerator. As a result, a program decomposed into components (a painless task compared to program parallelization) could almost transparently take advantage of an architecture containing accelerators. This approach would simultaneously address the power, performance and programmability issues of multi-cores.

Main research steps:

Focus on heterogeneous systems, composed of a mix of cores and accelerators to tackle power issues.
Define a programming approach based on independent components that can be transparently mapped to software executing on cores or to calls to hardware accelerators.

Defect-Tolerant accelerators

Power issues motivated the switch from cores to accelerators (and heterogeneous multi-cores). Beyond power, we have explained that defects will likely become a dominant issue. Since accelerators will do most of the performance heavy lifting in the future, the challenge is now to design defect-tolerant accelerators.

A custom circuit, just like a core, is very vulnerable: a single faulty transistor might wreck it down. Configurable logic offers more defect-tolerance by creating functions out of many identical logic elements; if one element gets faulty, then the functions can be mapped to the remaining valid elements. However, this approach assumes it is possible to test and identify each individual logic element and shut it out if found faulty. While this approach is valid and should be explored, as the number of defects increases, the overhead of safely identifying and disabling faulty elements may significantly hurt scalability. As a result, we want to focus on approaches that keep operating correctly even in the presence of defects, and without having to identify and disable the faulty elements (be it transistors or more complex logic elements).

These constraints have led us to artificial neural networks, where information distribution and learning capabilities provide inherent robustness. Now, after the hype of the 1990s, where companies like Intel or Philips built commercial hardware systems based on neural networks, the approach quickly lost ground for multiple reasons: hardware neural networks were no match for software neural networks run on rapidly progressing general-purpose processors, their application scope was considered too limited, and even progress in machine-learning theory overshadowed neural networks.

However, in the past few years, a remarkable convergence of trends and innovations is casting a new light on neural networks and makes them very attractive candidate accelerators of future computing systems. With respect to scope, Intel outlined in 2006 [16] that the community was not focusing on the key emerging high-performance applications. It defined these key applications as Recognition, Mining and Synthesis, and coined the term RMS. Example applications are face recognition for security applications, data mining for financial analysis, image synthesis for gaming, etc. Now, many of these applications, especially in Recognition and Mining, rely on algorithms for which competitive versions exist based on neural networks. Even in the machine-learning community, the recent advent of Deep Networks [15] , in 2006, has strongly revived interest in neural networks.

As a result, accelerators based on artificial neural networks have two key properties: they are inherently defect-tolerant and they are versatile accelerators, i.e., they can be used to tackle multiple core algorithms of several key RMS applications. That makes them ideal candidate accelerators if they can be architected to effectively sustain transistor defects.

Main research steps:

Explore more conventional digital accelerators based on configurable logic where faulty elements can be identified and disabled.
Define digital implementations of artificial neural networks (ANNs) which are robust to transistor defects.

Analog is inherently more defect-tolerant than digital

While digital ANNs can be made robust to defects, they remain an inefficient implementation: a fault on a low-order bit has little impact, but a fault on a high-order bit has a strong impact. In analog implementations, the magnitude of the value variation is correlated to the magnitude of the fault; so the effect of faults on the behavior of the circuit are more progressive. Analog has another asset: in embedded systems, the input is often originally analog (radio waves, sound, images,...), and an analog circuit would be able to process it natively, without digital conversion, saving circuit real-estate, power, and further improving robustness.

As a result, beyond digital accelerators, we want to investigate analog accelerators, especially analog neural network implementations. Neurons particularly shine for signal processing: because they are non-linear operators, they can easily implement complex analog functions such as integrators, which are commonplace in signal processing. As a result, complex signal processing tasks could be efficiently and directly implemented using neurons as operators, and learning could kick-in to compensate for errors if the function output deteriorates.

Main research steps:

Investigate analog accelerators, especially based on neurons used either as analog operators, or as part of a neural network for learning-based compensation of errors.

Beyond CMOS, but still silicon ?

Both digital and analog implementations rely on CMOS transistors. At the core of our research is the notion that CMOS size reduction may stall at some point. We should thus start investigating what could be done beyond CMOS. Interestingly, some of the key contender alternatives to the CMOS transistor, still based on silicon, are mightily compatible with the approach developed so far.

For our purpose, a prime silicon-based contender alternative is the memristor [20] . Theorized by Chua in 1971 [13] , the memristor is a novel silicon component which was effectively manufactured as a silicon device by Williams in 2008 for the first time [20] . This component implements a resistive memory: the resistance of the component can be changed and that resistance is memorized. This component is an almost ideal candidate for the implementation of either configurable logic (crossbars) or artificial synapses. In fact, the first memristor patents by Williams are about using the component to implement hardware artificial neural networks [19] . Synapses, which correspond to connections among neurons, memorize a weight applied to an input, i.e., almost the exact operating behavior of a memristor.

Among the other non-CMOS silicon contenders, PCMOS (Probabilistic CMOS) [12] is another interesting candidate. While this technology has been shown to be suitable to implement neural networks [11] , it is also well suited for implementing probabilistic, also called randomized, algorithms.

The goal of PCMOS is to leverage the irregular behavior of transistors due to low voltage and process variation for computing purposes. The transistor is considered to provide the correct answer but only with a certain probability. By revisiting application algorithms so they are designed as probabilistic algorithms, it is possible to design whole circuits that take advantage of that property to conceive very low-power and defect-tolerant architectures. While PCMOS is not currently our primary focus, we will most likely investigate it for building a range of accelerator tiles.

Main research steps:

Investigate hybrid implementations composed of CMOS logic (for neurons) and memristors (for synapses).
Maintain a technology watch on other alternatives, such as PCMOS, and investigate related accelerators and their scope.

Beyond silicon ?

Beyond silicon, other alternative technologies are being contemplated, ranging from carbon nanotubes, graphene transistors to molecular-size transistors or quantum computing. It is actually quite possible that none of the alternative technologies will prevail and truly replace the transistor, but that they will simply co-exist. It is all the more likely that each has particular strengths: as mentioned before, PCMOS is well suited for implementing probabilistic algorithms, memristor for implementing crossbars and synapses, quantum computing has strengths for certain categories of NP algorithms, etc. Whether technology unifies around a single approach or breaks down into multiple parallel paths, it remains compatible with the notion of heterogeneous systems and accelerators, where each accelerator would not only target certain algorithms, but would be designed with a certain technology. Consequently, this approach may actually largely shield us from the speculative nature of the upcoming technologies.

And among the possible technologies which are compatible with the approach developed so far in this document, i.e., the notion of neural network-based robust accelerators, biology emerges as a natural contender. While this may seem far-fetched at first sight, there already exist working implementations of transistors connected to individual biological neurons and forming information loops with observed transistor-to-neuron communications [14] . Infineon, an embedded systems company, has even developed a prototype chip, called the NeuroChip, for connecting a full layer of biological neurons with transistors.

Beyond biology as a technology, neurobiology may also bring a useful path for expanding the application scope of the contemplated accelerators. Already, detailed models, such as the HMAX model proposed by Poggio [18] , show how to reconstruct sophisticated vision processing tasks using individual neurons, i.e., eligible for a replicated implementation in hardware solely using neurons. Whether they are implemented using silicon technology, or hybrid silicon-biological technology, these models would allow to significantly expand the nature of the tasks of these accelerators beyond what artificial neural networks can do. Moreover, as the understanding of complex neurobiological functions progresses, they could be leveraged for our accelerators.

Main research steps:

Contemplate hybrid silicon-biology implementation of accelerators.
Factor in progress in neurobiology to expand the application scope of accelerators beyond ANNs.
Maintain a research watch on the progress of neurobiology to further expand application scope.

Previous |

Home | Next next