[FIG2] Four generations of DSPs show how multiprocessing has more effect on performance than clock rate. The dotted lines correspond to the increase in performance due to clock increases within an architecture. The solid line shows the increase due to both the clock increase and the parallel processing.
Address
|
Registers
|
|
Address
ALUs
|
multiprocessing,
processors versus accelerators, programmable versus fixed function, a mix of general-purpose processors and DSPs, or system in a package versus SoC integration. And then there is Amdahl’s Law that must be introduced to the mix [1], [2]. In addition, one needs to consider how the architecture dif- fers for high-performance applications versus long battery life portable applications.
neous multicore DSP architecture consists of cores
that are from the same type, meaning that all cores in the die are DSP processors. In contrast, heterogeneous architectures contain different types of cores. This can be a collection of DSPs with general-purpose processors (GPPs), graphics processing units (GPUs), or microcontroller units (MCUs). Another classification of multicore DSP processors is by the type of interconnects between the cores.
More details on the types of interconnect being used in multi- core DSPs as well as the memory hierarchy of these multiple cores are presented below, followed by an overview of the latest multicore chips. A brief discussion on performance analysis is also included.
others’ memory. Switches will be connected together to allow more distant CPUs to communicate with longer latency. Communication is done by memory transfer between the memories associated with the CPUs. Memory can be shared between CPUs or be local to a CPU. The most prominent type of memory architecture makes use of Level 1 (L1) local memo- ry dedicated to each core and Level 2 (L2), which can be dedi- cated or shared between the cores as well as Level 3 (L3) internal or external shared memory. If local, data is moved off that memory to another local memory using a non-CPU block in charge of block memory transfers, usually called direct memory access (DMA). The memory map of such a system can become quite complex and caches are often used to make the memory look “flat” to the programmer. L1, L2, and even L3 caches can be used to automatically move data around the memory hierarchy without explicit knowledge of this move- ment in the program. This simplifies and makes more portable the software written for such systems but comes at the price of uncertainty in the time a task needs to complete because of uncertainty in the number of cache misses [5].
In a mesh network [6], [7], the DSP processors are orga-
nized in a two-dimensional (2-D) array of nodes. The nodes are connected through a network of buses and multiple simple switching units. The cores are locally connected with their “north,” “south,” “east,” and “west” neighbors.
Memory is gen- erally local, though a single node might have a cache hierarchy. This architecture allows multicore DSP processors to scale to large numbers without increasing the complexity of the buses or switching units. However, the programmer generally has to
write code that is aware of the local nature of the CPU. Explicit message passing is often used to describe data movement.
Multicore DSP platforms can also be categorized as sym- metric multiprocessing (SMP) platforms and asymmetric mul- tiprocessing (AMP) platforms. In an SMP platform, a given task can be assigned to any of the cores without affecting the performance in terms of latency. In an AMP platform, the placement of
a task can affect the latency, giving an opportu- nity to optimize the performance by optimizing the placement of tasks. This optimization comes at the expense of an increased programming complexity since the programmer has to deal with both space (task assignment to multiple cores) and time (task scheduling). For example, the mesh network architecture of Figure 4 is AMP since placing dependent tasks that need to heavily communicate in neighboring processors will significantly reduce the latency. In contrast, in a hierar- chical interconnected architecture, in which the cores mostly communicate by means of a shared L2/L3 memory and have to cache
data from the shared memory, the tasks can be assigned to any of the cores without significantly affecting the latency. SMP platforms are easy to program but can result in a much increased latency as compared to AMP platforms.