platform software,
compiler designs, and tooling for code development and debug. This article presents an overview of existing multicore DSP architectures as well as programming models, software tools, emerging applications,
challenges, and future trends of multicore DSPs.
HISTORICAL PERSPECTIVES:
FROM SINGLE CORE TO MULTICORE
The concept of a DSP came about in the mid-1970s. Its roots were nurtured in the soil of a growing number of university research centers creating a body of theory on how to solve real- world problems using a digital computer. This research was aca- demic in nature and was not considered practical since it required the use of state-of-the-art computers and was not possible to do in real time.
It was a few years later that a toy by the name of Speak & Spell was created using a single integrated circuit to synthesize speech. This device made the following two bold statements:
digital signal processing can be done in real time
DSPs can be cost effective.
This began the era of the DSP. So, what made a DSP device dif- ferent from other microprocessors? Simply put, it was the DSP’s attention to doing complex math while guaranteeing real-time processing
. Architectural details such as dual/multiple
data buses, logic to prevent over/underflow, single cycle complex instructions, hardware multiplier, little or no capability to interrupt, and special instructions to handle signal processing constructs gave the DSP its ability to do the required complex math in real time.
“If I can’t
do it with one DSP, why not use two of them?” That is the answer obtained from many customers after the introduc- tion of DSPs with enough performance to change the designer’s mind set from “how do I squeeze my algorithm into this device” to “guess what, when I divide the performance that I need to do this task by the performance of a DSP, the number is small.” The first encounter with this was a year or so after Texas Instruments (TI) introduced the first floating-point DSP, called the TMS320C30. It had significantly more performance than its fixed-point predecessors. TI took on the task of seeing what cus- tomers were doing with this new DSP that they weren’t doing with previous ones. The significant finding was that none of the customers were using only one device in their system. They were using multiple DSPs working together to create their solutions.
As the performance of the DSPs increased, more sophisticated applications began to be handled in real time. So, it went from voice to audio to image to video processing. Figure 1 depicts this evolution. The four lines in Figure 1 represent the performance increases of DSPs in terms of instruction cycles per sample period. For example, the sample rate for voice is 8 kHz. Initial DSPs allowed for about 625
instructions per sample period,
In the case of voice, algorithms such as noise cancellation, echo cancellation, and voice band modems were able to be added as a result of the increased performance made avail- able. Figure 2 depicts how this increase in performance was more the result of multiprocessing rather than higher perfor- mance single processing elements. Because digital signal pro- cessing algorithms are multiply-accumulate (MAC)
intensive, Figure 2 shows how, by adding multipliers to the architec- ture, the performance followed an aggressive growth rate. Adding multiplier units is the simplest form of doing multi- processing in a DSP device.
For TI, the obvious next step was to architect the next genera- tion DSPs with the communications ports necessary to matrix multiple DSPs together in the same system. That device was creat- ed and introduced as the TMS320C40. And, as one might suspect, a follow-up (fixed-point) device was created with multiple DSPs on one device under the management of a reduced instruction set computer (RISC)
processor, the TMS320C80.
Instruction Cycles Per Sample Period
The proliferation of computationally demanding applications drove the need to integrate multiple processing elements on the same piece of silicon. This lead to a whole new world of architec- tural options: homogeneous multiprocessing, heterogeneous
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
[FIG1] Four examples of the increase of instruction cycles per sample period. It appears that the DSP becomes useful when it can perform a minimum of 100 instructions per sample period. Note that for a video system the pixel is used in place of a sample.