Frontier Journal
Exclusive Frontier Coverage on System Design
Vol. 3 No. 2 Feb 2006
GUEST EDITORIAL The Inexorable Progression of Parallelism in SoC - STMicroelectronics
SoC Market and Economic Trends Pierre Paulin, Director of SoC Platform Automation STMicroelectronics The
current deep submicron technology era presents two opposing challenges: rising
SoC platform development costs and shorter product market windows. Compounding
the problem is the rate of change due to evolving specifications, and the
appearance of multiple standards that need to be incorporated into a design. The
rising platform development costs are due to three main sources. The first is
the continued rise in gate and memory count. Today*s SoCs can have over 100
million transistors 每 enough to theoretically place the logic of over one
thousand 32 bit RISC processors on a single die. Leveraging these capabilities
is a major challenge. The second
is the increased complexity of dealing with deep submicron effects. These
include electro-migration, voltage-drop, and on-chip variations. These effects
are having a dampening impact on design productivity. Also, rising mask set costs 每 currently over one
million dollars 每 compound the problem and present a nearly insurmountable
financial market entry barrier for smaller companies. The third source is the rising embedded
software development costs in current generation SoCs, driven by an accelerated
rate of new feature introduction. This is partly due to the convergence
of computing, consumer and communications domains which implies supporting a
broader range of functionalities and standards for a wide set of geographic
markets. While the growth of hardware complexity in SoCs has tracked Moore*s
law, with a resulting growth of 56% in transistor count per year, industry
studies show that the complexity of embedded S/W is rising at a staggering 140%
per year. This software now represents
over 50% of development costs in most SoCs, and above 75% in emerging
multi-processor SoC platforms. As a
result, the significant investment to develop the platform 每 typically between
10M$ and 100M$ for today*s 90nm
platforms 每 requires to maximize the time-in-market
for a given platform. On the other hand, the consumer-led product cycles imply
increasingly shorter time-to-market
for the applications supported by the platform. All
of these lead to the need for a domain-specific flexible platform which can be
reused across a wide range of application variants. In addition, time-to-market
considerations means that the platform must come with high-level
application-to-platform mapping tools that increase developer productivity.
Both of these requirements point in the direction of highly S/W programmable
platform solutions. A wide range of general-purpose and domain-specific cores
exist and they come with powerful compilation, debug and analysis tools. This
makes them a key component of the flexible SoC of the future. Before
examining the possible platform solutions to these two challenges, let*s
examine some of the emerging technology process-driven trends that will affect
the possible platform architecture choices. Deep
Submicron SoC Technology Trends There
are four emerging technology trends that will have to be considered when
developing the next generation SoC platforms with embedded processors. The
Clock Frequency Barrier The
first technology trend is that we are approaching a de facto limit on the speed
of embedded processors, which, in the embedded core space, are currently
reaching a clock frequency plateau of approximately 1 GHz. While it may be
technically feasible to develop faster embedded processor platforms 每 general
purpose processors run at over 3.5 GHz 每 it is increasingly becoming an economic
practical limit. Full-custom design, specialized processes and high design
resource investments are needed to go beyond the 1 GHz barrier and these
investments are not in line with the embedded processor and SoC market
realities. This market requires a fast technology porting phase at a reasonable
cost, using primarily a synthesizable core approach. As a response to this, the
leading suppliers of high-speed RISC-style cores are now offering platforms
with two to four parallel cores running at 1 GHz or less. This trend will
continue. The
Power Barrier A
second key trend is the increasing importance of low power. Even for
applications that could make use of a single, high-speed processor, it is more
power-effective to use a large number of simpler, slower processors, if the
tasks can be distributed in parallel onto these cores. The
following table illustrates this point, using the published power figures for
the ARM processor family, for a 130 nm CMOS process. As we can see in Table 1,
a 5X performance increase requires a 20X increase in dissipated power. This
implies a 4X power efficiency ratio for small, simple processors like ARM7s
over a more complex processor like an ARM11. The former uses a simple three
stage pipeline, while the latter uses an eight stage pipeline, separate
load/store and arithmetic pipelines, branch prediction, and other performance
enhancing hardware. All of these contribute to lower power efficiency. Note
that this scaling effect is similar for other RISC cores, such as the MIPS licensable
cores. It is also important to note that this comparison applies only for
scalar, general purpose code. The use of SIMD and DSP units on the ARM11 can
significantly improve the power/performance ratio in specific applications. Table 1: Power scaling for commercial RISC processors Processor Frequency Performance Core power Power/Performance ARM7 TDMI Core 133 (1X) 133 (1X) 8 (1X) 60 (1X) ARM11 MP Core 550 (4X) 675 (5X) 165 (20X) 254 (4X) The
Latency Barrier A
third key trend which occurs with technology scaling is the increasing gap of
the speed of computing elements, and that of interconnect and storage, leading
to two types of latency. At the lowest level, the continued rise of the gap
between gate delays and interconnect delays implies faster computing nodes
communicating across a chip with increased interconnect latencies. It is
predicted that propagating a signal across a chip at the 45nm process node will
require over 5 clock cycles. The
second type of latency is due to the continued rise of the gap between embedded
processor speeds and memory speeds. This gap increases by about two times every
two years. This implies higher relative latencies for memory read and write. Finally,
due to low-power requirements, or the 1 GHz performance limit, multi-processor
solutions will become increasingly common. This implies a higher rate of
inter-processor communication, adding a third form of latency, in this case, at
the architectural level. The
combination of these three types of latency implies that processors will be
increasingly idle while waiting for data transmission, data storage and
retrieval, and inter-processor communication. An approach which can keep the
processor busy on other tasks is therefore essential. This implies two things:
the availability of runnable parallel threads at the application level, and a
mechanism which supports fast context switches between logical threads. If the
thread context switch requires as many cycles as the interconnect, memory or
inter-processor communication, then the benefit of parallelism will be lost. The
Reliability Barrier Although
not an immediate concern in the current mainstream 90nm process node, chip
reliability will become an increasing concern for the next process generations.
The use of a regular array of parallel, symmetric processors greatly simplifies
the provision of system-level redundancy and fault-tolerance. It is also a
means to achieve higher manufacturing yield by over-provisioning the platform. From
the above market and process technology trends, it is clear multi-processor
based platforms will play a key role. A key question is therefore how to
effectively exploit this type of platform. We need to tackle these challenges
from three main directions: 1) The development of high-level
platform programming models 2) The design of parallel platforms
which support the programming models and facilitate the development of the
platform mapping tools. 3) The development of effective
platform mapping technologies. Platform
Programming Models A
SoC platform programming model is an abstraction of a heterogeneous system,
consisting of a range of loosely and tightly coupled processors, and various
hardware accelerators. Based on our experience in developing programming
models, platform mapping tools and parallel platforms over the last five years,
we believe that the leading edge parallel programming models developed for
general-purpose computing can be reused and adapted to the embedded
multiprocessor space. In our view, at least two programming models are needed: 1) A distributed object model (DOM),
in the spirit of CORBA or DCOM. This uses objects to abstract communication and
relies on an explicit message-passing communication scheme. 2) A symmetric multi-processor (SMP)
model, in the spirit of POSIX threads. This programming model relies on
symmetric processing resources which access a shared memory. Both
of these programming models have their advantages and inconveniences, and we
have found that, for the consumer-style multimedia SoC platforms we have been
working with, we need to use both in a tightly coupled, interoperable fashion. A
key assumption here is that the application developer is responsible for
identifying and explicitly expressing parallelism using one or the other
programming model. In our experience, for domain-specific application code in
imaging, video and audio, this is a reasonable assumption. Parallelism is
tractable and well understood in most cases. Moreover, designers have been
dealing with this type of parallelism in hardware-based platforms for many
years. For an application like an MPEG4 video encoder consisting of ten
thousand lines of sequential reference code, the parallelization represents
less than one person-month of work. Parallel
Multi-Processor SoC Platforms The
platforms themselves will require the support of two levels of parallelism:
multi-threaded and multi-core. Large scale use of multiple processor cores will
arise primarily from the needs of low-power. Hardware multithreading 每 which
consists of duplicating architectural registers to offer a zero overhead
context switch 每 will be needed to address the latency hiding requirements. ﹞ Hardware multi-threading has been
available for many years in network processors. More recently, it has become
available as an option of licensable embedded cores, for example the MIPS M24K
family. In the general-purpose computing space, the recent Sun Niagara supports
hardware multi-threading on multiple cores, for all of the reasons cited above.
Ideally,
from a tools-only perspective, the emerging SoC platforms would embed a
homogeneous set of parallel processors. This is not realistic for the
foreseeable future: ﹞ Domain-specific cores like DSPs
offer 2X to 4X performance in their domain of application via instruction
specialization and wider instruction words. The combination of SIMD-style
word-level parallelism can increase performance by another factor of 2X to 8X
in certain cases. ﹞ Configurable ASIPs
(application-specific instruction-set processors) can offer 10X to 100X
performance improvements via application-specific instruction-sets and tightly
coupled H/W co-processors. ﹞ Hardware co-processors can offer
100X or more performance advantages and/or significant power and area savings.
They will remain essential for highly parallel, regular operations with high
data rates. In particular, for data processing operations which are fixed for
an application domain (e.g. direct and inverse DCTs in video processing). ﹞ Legacy code and general-purpose
operating system support will often dictate the host processor for the
platform. The data representation used in this processor is not likely to be
compatible with the parallel processor subsystems, or the hardware
co-processors. ﹞ Some application tasks will not be
parallelizable, therefore one single, fast general-purpose core will be
necessary to support these. As a
result, for the foreseeable future, we believe that a performance- and power-effective
platform for the consumer-dominated convergence platforms will be composed of a
heterogeneous composition of the following component classes: ﹞ A medium to high-performance,
general-purpose RISC core, typically running a standard general-purpose
operating system. All the top-level control code will run here. Legacy code
that is not performance critical will also run on this processor. Finally,
customer-specific developments and controlled access to the domain-specific
parallel subsystems will usually occur via this processor and general-purpose
O/S pair. ﹞ Domain-specific subsystems composed
of mostly homogeneous multi-processor.
These processors will consist of domain- and application-specific
processors. In many cases, they will have tightly coupled hardware
co-processors. ﹞ Loosely coupled hardware
co-processors for fixed, domain-specific functions. Increasingly, these will be
implemented in one-time configurable logic (e.g. structured ASICs), or in the
longer term, using an embedded FPGA fabric. ﹞ Domain-specific, but increasingly
flexible I/O blocks. Platform
Programming Model and Mapping Requirements This
leaves some significant, but not insurmountable challenges for the platform
programming model and mapping tools. Before discussing the requirements, here
are some key observations from our experience: ﹞ Coarse grain parallelism is limited
in most embedded applications. Typically, less than a dozen parallel tasks can
be identified. Traditional operating systems can effectively support this level
of parallelism. But, without exploiting smaller grain parallelism, only a
limited amount of data-intensive processing can be performed on a single
general-purpose processor. ﹞ In a wide range of video, 3D,
imaging, audio and networking applications, there is much exploitable medium to
fine-grain parallelism, consisting of threads with a few hundred
instructions. ﹞ Despite the trend towards processors
and programmability, most (e.g. 80% or higher) of the raw computation of future
SoC devices will remain in hardware (increasingly implemented with configurable
or reconfigurable fabrics).
Processors will also offer a range of acceleration, including vector and
data parallel options, custom instruction sets, and application-specific coprocessors. From
these observations, our view on the platform, programming model, and mapping
requirements are as follows: ﹞ Efficient implementations of the SMP
(symmetric multi-processing) and DOM (distributed object model) platform
programming models are required, which offer abstraction yet also support fine
grain parallelism. ﹞ The highly constrained SoC
environment requires that platform programming model overhead should be
minimal, in the order of 1000 instructions. ﹞ Efficient support of fine grain
parallelism requires a platform with: a) hardware multi-threading, b)
hardware-assisted thread scheduling, c) hardware-assisted message-passing. ﹞ The platform programming model
should allow the transparent mapping of functionality on to a range of possible
targets, including heterogeneous processors, hardware, or (re)configurable
hardware. ﹞ To facilitate transparency, it is
necessary to hide the heterogeneity of the underlying platform from the
application programmer. This implies the use of a CORBA-like homogeneous object
model on top of the heterogeneous platform. This requires the concept of an interface definition language (IDL)
describing object interfaces. ﹞ The mapping technology must support
a range of mapping timescales, ranging from a) static assignment, b) occasional
reconfiguration, to c) dynamic remapping of symmetric and heterogeneous
resources, on a task-by-task basis.
﹞ The mapping technology must support
exploration, optimization, quality of service options, functional verification,
and timing (real-time) verification. ﹞ Coarse grain parallelism can be
handled by the GP O/S. The
requirement here is to support the lightweight SMP and DOM programming model
abstractions on top of the O/S, without letting the multi-layer GP software
prevent efficient subsystem
interaction. The
Inexorable Progression of Parallelism In
conclusion, the exploitation of various forms of parallelism is becoming
necessary to address the market and technology trends associated with the
emerging deep submicron SoC platforms. The three key priorities for the
semiconductor and EDA industries are:
1) the definition of high-level,
parallel platform programming models, 2) the development of flexible,
multi-processor platforms designed for the efficient support of the programming
models, and, most importantly, 3) the development of effective
application-to-platform mapping technologies.
Brian Bailey*s Column 每 Understanding ESL
Targeted Hot Spots Ease ABV Adoption - Mentor Graphics
Dr. Danny Rittmen*s Column 每 Nanometer Prototyping
Me Too Is Not My Style (Part VII) - Acer Group
Switching in USB Consumer Applications - Analog Devices
Integrated Approach to Audio System Design
Verification Methodology Manual (Part VI) - TransEDA
(MHz)
(DMIPS)
(mW)
(mW/GIPS)