• [FIG2] Four generations of DSPs show how multiprocessing has more effect on performance than clock rate. The dotted lines correspond to the increase
  • [FIG3] Typical multicore DSP platform.
  • ARCHITECTURES OF MULTICORE DSPs




    Download 0,76 Mb.
    bet3/14
    Sana23.07.2024
    Hajmi0,76 Mb.
    #268329
    1   2   3   4   5   6   7   8   9   ...   14
    Bog'liq
    TrendsInMulticoreDSP.Gatherer

    ARCHITECTURES OF MULTICORE DSPs



    C64x+ Eight MAC/Cycle C64x+ Four MAC/Cycle C62x+ Two MAC/Cycle C1x/2x+ One MAC/Cycle

    1982
    1984
    1986
    1988
    1990
    1992
    1994
    1996
    1998
    2000
    2002
    2004
    2006
    2008
    2010



























































































































    In 2008, 68% of all shipped DSP processors were used in the wireless sector, especially in mobile handsets and base sta- tions; so, naturally, development in wireless infrastructure and applications is the current driving force behind the evolution of DSP processors and their architectures [3]. The emergence of new applications such as mobile TV and high-speed Internet browsing on mobile devices greatly increased the demand for more processing power while lowering cost and power con- sumption. Therefore, multicore DSP architectures were estab- lished as a viable solution for high-performance applications in packet telephony, third generation (3G) wireless infrastruc- ture and worldwide interoperability for microwave access (WiMAX) [4]. This shift to multicore shows significant im- provements in performance, power consumption, and space requirements while lowering costs and clocking frequencies. Figure 3 illustrates a typical multicore DSP platform.
    Current state-of-the-art multicore DSP platforms can be defined by the type of cores available in the chip and include homogeneous and heterogeneous architectures. A homoge-

    [FIG2] Four generations of DSPs show how multiprocessing has more effect on performance than clock rate. The dotted lines correspond to the increase in performance due to clock increases within an architecture. The solid line shows the increase due to both the clock increase and the parallel processing.

    Address

    Registers




    Address
    ALUs



    multiprocessing, processors versus accelerators, programmable versus fixed function, a mix of general-purpose processors and DSPs, or system in a package versus SoC integration. And then there is Amdahl’s Law that must be introduced to the mix [1], [2]. In addition, one needs to consider how the architecture dif- fers for high-performance applications versus long battery life portable applications.
    neous multicore DSP architecture consists of cores that are from the same type, meaning that all cores in the die are DSP processors. In contrast, heterogeneous architectures contain different types of cores. This can be a collection of DSPs with general-purpose processors (GPPs), graphics processing units (GPUs), or microcontroller units (MCUs). Another classification of multicore DSP processors is by the type of interconnects between the cores.
    More details on the types of interconnect being used in multi- core DSPs as well as the memory hierarchy of these multiple cores are presented below, followed by an overview of the latest multicore chips. A brief discussion on performance analysis is also included.




    Program Unit

    Address Unit

    Data Unit




    Data ALU




    Debugging

    Registers




    JTAG/EOnCE




    Data ALUs




    Power Management




    [FIG3] Typical multicore DSP platform.



    latency and high bandwidth will be placed close together on a shared switch and will have low latency access to each


    [FIG4] Interconnect types of (a) hierarchical network and (b) mesh network multicore DSP architectures.

    others’ memory. Switches will be connected together to allow more distant CPUs to communicate with longer latency. Communication is done by memory transfer between the memories associated with the CPUs. Memory can be shared between CPUs or be local to a CPU. The most prominent type of memory architecture makes use of Level 1 (L1) local memo- ry dedicated to each core and Level 2 (L2), which can be dedi- cated or shared between the cores as well as Level 3 (L3) internal or external shared memory. If local, data is moved off that memory to another local memory using a non-CPU block in charge of block memory transfers, usually called direct memory access (DMA). The memory map of such a system can become quite complex and caches are often used to make the memory look “flat” to the programmer. L1, L2, and even L3 caches can be used to automatically move data around the memory hierarchy without explicit knowledge of this move- ment in the program. This simplifies and makes more portable the software written for such systems but comes at the price of uncertainty in the time a task needs to complete because of uncertainty in the number of cache misses [5].
    In a mesh network [6], [7], the DSP processors are orga-
    nized in a two-dimensional (2-D) array of nodes. The nodes are connected through a network of buses and multiple simple switching units. The cores are locally connected with their “north,” “south,” “east,” and “west” neighbors. Memory is gen- erally local, though a single node might have a cache hierarchy. This architecture allows multicore DSP processors to scale to large numbers without increasing the complexity of the buses or switching units. However, the programmer generally has to
    write code that is aware of the local nature of the CPU. Explicit message passing is often used to describe data movement.
    Multicore DSP platforms can also be categorized as sym- metric multiprocessing (SMP) platforms and asymmetric mul- tiprocessing (AMP) platforms. In an SMP platform, a given task can be assigned to any of the cores without affecting the performance in terms of latency. In an AMP platform, the placement of a task can affect the latency, giving an opportu- nity to optimize the performance by optimizing the placement of tasks. This optimization comes at the expense of an increased programming complexity since the programmer has to deal with both space (task assignment to multiple cores) and time (task scheduling). For example, the mesh network architecture of Figure 4 is AMP since placing dependent tasks that need to heavily communicate in neighboring processors will significantly reduce the latency. In contrast, in a hierar- chical interconnected architecture, in which the cores mostly communicate by means of a shared L2/L3 memory and have to cache data from the shared memory, the tasks can be assigned to any of the cores without significantly affecting the latency. SMP platforms are easy to program but can result in a much increased latency as compared to AMP platforms.



    Download 0,76 Mb.
    1   2   3   4   5   6   7   8   9   ...   14




    Download 0,76 Mb.