System Design Frontier with Exclusive Coverage on IC Design and Software Engineering from Hometown Innovation Automation Inc- Journal Page

 Frontier Journal       

Exclusive Frontier Coverage on System Design              Vol. 3 No. 2 Feb 2006

            GUEST EDITORIAL The Inexorable Progression of Parallelism in SoC - STMicroelectronics

Brian Bailey*s Column 每 Understanding ESL

Targeted Hot Spots Ease ABV Adoption - Mentor Graphics

Dr. Danny Rittmen*s Column 每 Nanometer Prototyping

Me Too Is Not My Style (Part VII) - Acer Group

Switching in USB Consumer Applications - Analog Devices

Integrated Approach to Audio System Design

Verification Methodology Manual (Part VI) - TransEDA



SoC Market and Economic Trends

 

Pierre Paulin, Director of SoC Platform Automation

STMicroelectronics

The current deep submicron technology era presents two opposing challenges: rising SoC platform development costs and shorter product market windows. Compounding the problem is the rate of change due to evolving specifications, and the appearance of multiple standards that need to be incorporated into a design.

The rising platform development costs are due to three main sources. The first is the continued rise in gate and memory count. Today*s SoCs can have over 100 million transistors 每 enough to theoretically place the logic of over one thousand 32 bit RISC processors on a single die. Leveraging these capabilities is a major challenge. The second is the increased complexity of dealing with deep submicron effects. These include electro-migration, voltage-drop, and on-chip variations. These effects are having a dampening impact on design productivity. Also, rising mask set costs 每 currently over one million dollars 每 compound the problem and present a nearly insurmountable financial market entry barrier for smaller companies.

The third source is the rising embedded software development costs in current generation SoCs, driven by an accelerated rate of new feature introduction. This is partly due to the convergence of computing, consumer and communications domains which implies supporting a broader range of functionalities and standards for a wide set of geographic markets. While the growth of hardware complexity in SoCs has tracked Moore*s law, with a resulting growth of 56% in transistor count per year, industry studies show that the complexity of embedded S/W is rising at a staggering 140% per year. This software now represents over 50% of development costs in most SoCs, and above 75% in emerging multi-processor SoC platforms.

As a result, the significant investment to develop the platform 每 typically between 10M$ and 100M$ for today*s 90nm platforms 每 requires to maximize the time-in-market for a given platform. On the other hand, the consumer-led product cycles imply increasingly shorter time-to-market for the applications supported by the platform.

All of these lead to the need for a domain-specific flexible platform which can be reused across a wide range of application variants. In addition, time-to-market considerations means that the platform must come with high-level application-to-platform mapping tools that increase developer productivity. Both of these requirements point in the direction of highly S/W programmable platform solutions. A wide range of general-purpose and domain-specific cores exist and they come with powerful compilation, debug and analysis tools. This makes them a key component of the flexible SoC of the future.

Before examining the possible platform solutions to these two challenges, let*s examine some of the emerging technology process-driven trends that will affect the possible platform architecture choices.

Deep Submicron SoC Technology Trends

There are four emerging technology trends that will have to be considered when developing the next generation SoC platforms with embedded processors.

The Clock Frequency Barrier

The first technology trend is that we are approaching a de facto limit on the speed of embedded processors, which, in the embedded core space, are currently reaching a clock frequency plateau of approximately 1 GHz. While it may be technically feasible to develop faster embedded processor platforms 每 general purpose processors run at over 3.5 GHz 每 it is increasingly becoming an economic practical limit. Full-custom design, specialized processes and high design resource investments are needed to go beyond the 1 GHz barrier and these investments are not in line with the embedded processor and SoC market realities. This market requires a fast technology porting phase at a reasonable cost, using primarily a synthesizable core approach. As a response to this, the leading suppliers of high-speed RISC-style cores are now offering platforms with two to four parallel cores running at 1 GHz or less. This trend will continue.

The Power Barrier

A second key trend is the increasing importance of low power. Even for applications that could make use of a single, high-speed processor, it is more power-effective to use a large number of simpler, slower processors, if the tasks can be distributed in parallel onto these cores.

The following table illustrates this point, using the published power figures for the ARM processor family, for a 130 nm CMOS process. As we can see in Table 1, a 5X performance increase requires a 20X increase in dissipated power. This implies a 4X power efficiency ratio for small, simple processors like ARM7s over a more complex processor like an ARM11. The former uses a simple three stage pipeline, while the latter uses an eight stage pipeline, separate load/store and arithmetic pipelines, branch prediction, and other performance enhancing hardware. All of these contribute to lower power efficiency. Note that this scaling effect is similar for other RISC cores, such as the MIPS licensable cores. It is also important to note that this comparison applies only for scalar, general purpose code. The use of SIMD and DSP units on the ARM11 can significantly improve the power/performance ratio in specific applications.

 

Table 1: Power scaling for commercial RISC processors

Processor

Frequency
(MHz)

Performance
(DMIPS)

Core power
(mW)

Power/Performance
(mW/GIPS)

ARM7 TDMI Core

133     (1X)

133     (1X)

8         (1X)

60       (1X)

ARM11 MP Core

550     (4X)

675     (5X)

165     (20X)

254     (4X)

 

The Latency Barrier

A third key trend which occurs with technology scaling is the increasing gap of the speed of computing elements, and that of interconnect and storage, leading to two types of latency. At the lowest level, the continued rise of the gap between gate delays and interconnect delays implies faster computing nodes communicating across a chip with increased interconnect latencies. It is predicted that propagating a signal across a chip at the 45nm process node will require over 5 clock cycles.

The second type of latency is due to the continued rise of the gap between embedded processor speeds and memory speeds. This gap increases by about two times every two years. This implies higher relative latencies for memory read and write.

Finally, due to low-power requirements, or the 1 GHz performance limit, multi-processor solutions will become increasingly common. This implies a higher rate of inter-processor communication, adding a third form of latency, in this case, at the architectural level.

The combination of these three types of latency implies that processors will be increasingly idle while waiting for data transmission, data storage and retrieval, and inter-processor communication. An approach which can keep the processor busy on other tasks is therefore essential. This implies two things: the availability of runnable parallel threads at the application level, and a mechanism which supports fast context switches between logical threads. If the thread context switch requires as many cycles as the interconnect, memory or inter-processor communication, then the benefit of parallelism will be lost.

The Reliability Barrier

Although not an immediate concern in the current mainstream 90nm process node, chip reliability will become an increasing concern for the next process generations. The use of a regular array of parallel, symmetric processors greatly simplifies the provision of system-level redundancy and fault-tolerance. It is also a means to achieve higher manufacturing yield by over-provisioning the platform.

From the above market and process technology trends, it is clear multi-processor based platforms will play a key role. A key question is therefore how to effectively exploit this type of platform. We need to tackle these challenges from three main directions:

1) The development of high-level platform programming models

2) The design of parallel platforms which support the programming models and facilitate the development of the platform mapping tools.

3) The development of effective platform mapping technologies.

Platform Programming Models

A SoC platform programming model is an abstraction of a heterogeneous system, consisting of a range of loosely and tightly coupled processors, and various hardware accelerators. Based on our experience in developing programming models, platform mapping tools and parallel platforms over the last five years, we believe that the leading edge parallel programming models developed for general-purpose computing can be reused and adapted to the embedded multiprocessor space. In our view, at least two programming models are needed:

1) A distributed object model (DOM), in the spirit of CORBA or DCOM. This uses objects to abstract communication and relies on an explicit message-passing communication scheme.

2) A symmetric multi-processor (SMP) model, in the spirit of POSIX threads. This programming model relies on symmetric processing resources which access a shared memory.

 

Both of these programming models have their advantages and inconveniences, and we have found that, for the consumer-style multimedia SoC platforms we have been working with, we need to use both in a tightly coupled, interoperable fashion.

A key assumption here is that the application developer is responsible for identifying and explicitly expressing parallelism using one or the other programming model. In our experience, for domain-specific application code in imaging, video and audio, this is a reasonable assumption. Parallelism is tractable and well understood in most cases. Moreover, designers have been dealing with this type of parallelism in hardware-based platforms for many years. For an application like an MPEG4 video encoder consisting of ten thousand lines of sequential reference code, the parallelization represents less than one person-month of work.

Parallel Multi-Processor SoC Platforms

The platforms themselves will require the support of two levels of parallelism: multi-threaded and multi-core. Large scale use of multiple processor cores will arise primarily from the needs of low-power. Hardware multithreading 每 which consists of duplicating architectural registers to offer a zero overhead context switch 每 will be needed to address the latency hiding requirements.

Hardware multi-threading has been available for many years in network processors. More recently, it has become available as an option of licensable embedded cores, for example the MIPS M24K family. In the general-purpose computing space, the recent Sun Niagara supports hardware multi-threading on multiple cores, for all of the reasons cited above.

Ideally, from a tools-only perspective, the emerging SoC platforms would embed a homogeneous set of parallel processors. This is not realistic for the foreseeable future:

Domain-specific cores like DSPs offer 2X to 4X performance in their domain of application via instruction specialization and wider instruction words. The combination of SIMD-style word-level parallelism can increase performance by another factor of 2X to 8X in certain cases.

Configurable ASIPs (application-specific instruction-set processors) can offer 10X to 100X performance improvements via application-specific instruction-sets and tightly coupled H/W co-processors.

Hardware co-processors can offer 100X or more performance advantages and/or significant power and area savings. They will remain essential for highly parallel, regular operations with high data rates. In particular, for data processing operations which are fixed for an application domain (e.g. direct and inverse DCTs in video processing). 

Legacy code and general-purpose operating system support will often dictate the host processor for the platform. The data representation used in this processor is not likely to be compatible with the parallel processor subsystems, or the hardware co-processors.

Some application tasks will not be parallelizable, therefore one single, fast general-purpose core will be necessary to support these.

As a result, for the foreseeable future, we believe that a performance- and power-effective platform for the consumer-dominated convergence platforms will be composed of a heterogeneous composition of the following component classes:

A medium to high-performance, general-purpose RISC core, typically running a standard general-purpose operating system. All the top-level control code will run here. Legacy code that is not performance critical will also run on this processor. Finally, customer-specific developments and controlled access to the domain-specific parallel subsystems will usually occur via this processor and general-purpose O/S pair.

Domain-specific subsystems composed of mostly homogeneous multi-processor.  These processors will consist of domain- and application-specific processors. In many cases, they will have tightly coupled hardware co-processors.

Loosely coupled hardware co-processors for fixed, domain-specific functions. Increasingly, these will be implemented in one-time configurable logic (e.g. structured ASICs), or in the longer term, using an embedded FPGA fabric.

Domain-specific, but increasingly flexible I/O blocks.

Platform Programming Model and Mapping Requirements

This leaves some significant, but not insurmountable challenges for the platform programming model and mapping tools. Before discussing the requirements, here are some key observations from our experience:

Coarse grain parallelism is limited in most embedded applications. Typically, less than a dozen parallel tasks can be identified. Traditional operating systems can effectively support this level of parallelism. But, without exploiting smaller grain parallelism, only a limited amount of data-intensive processing can be performed on a single general-purpose processor.

In a wide range of video, 3D, imaging, audio and networking applications, there is much exploitable medium to fine-grain parallelism, consisting of threads with a few hundred instructions. 

Despite the trend towards processors and programmability, most (e.g. 80% or higher) of the raw computation of future SoC devices will remain in hardware (increasingly implemented with configurable or reconfigurable fabrics).   Processors will also offer a range of acceleration, including vector and data parallel options, custom instruction sets, and application-specific coprocessors.

From these observations, our view on the platform, programming model, and mapping requirements are as follows:

Efficient implementations of the SMP (symmetric multi-processing) and DOM (distributed object model) platform programming models are required, which offer abstraction yet also support fine grain parallelism. 

The highly constrained SoC environment requires that platform programming model overhead should be minimal, in the order of 1000 instructions.

Efficient support of fine grain parallelism requires a platform with: a) hardware multi-threading, b) hardware-assisted thread scheduling, c) hardware-assisted message-passing.

The platform programming model should allow the transparent mapping of functionality on to a range of possible targets, including heterogeneous processors, hardware, or (re)configurable hardware.

To facilitate transparency, it is necessary to hide the heterogeneity of the underlying platform from the application programmer. This implies the use of a CORBA-like homogeneous object model on top of the heterogeneous platform. This requires the concept of an interface definition language (IDL) describing object interfaces.

The mapping technology must support a range of mapping timescales, ranging from a) static assignment, b) occasional reconfiguration, to c) dynamic remapping of symmetric and heterogeneous resources, on a task-by-task basis. 

The mapping technology must support exploration, optimization, quality of service options, functional verification, and timing (real-time) verification.

Coarse grain parallelism can be handled by the GP O/S.  The requirement here is to support the lightweight SMP and DOM programming model abstractions on top of the O/S, without letting the multi-layer GP software prevent efficient subsystem interaction.

The Inexorable Progression of Parallelism

In conclusion, the exploitation of various forms of parallelism is becoming necessary to address the market and technology trends associated with the emerging deep submicron SoC platforms. The three key priorities for the semiconductor and EDA industries are: 

1) the definition of high-level, parallel platform programming models,

2) the development of flexible, multi-processor platforms designed for the efficient support of the programming models, and, most importantly,

3) the development of effective application-to-platform mapping technologies.

 

Back to Joural Home Page

Back to Home Page