Note: This page is out of date, I will update soon.
DSP Compilers and Future BASE BAND Processing Architectures
(2008 – Present)
Digital signal processor architectures are typically idiosyncratic, as they often require very powerful instructions not typically utilized in general purpose computing. Additionally, they are often able to execute many instructions in parallel in a given clock cycle. It is often the compiler’s responsibility to schedule this highly parallel computation on the target architecture, as most modern high performance DSP architectures are VLIW based. As such, the DSP architecture should not only be highly parallel, but also highly orthogonal and easily targetable by an optimizing compiler.
At the same time, the amount of computation needed for channel estimation, spreading/despreading and other functionality related to baseband processing can outstrip the computational resources of any one single DSP core. This problem is further complicated as higher layers of the baseband processing protocol are incorporated into the basestation, as is the case with remote network controller (RNC) integration into the basestation for LTE. These higher protocol layers typically entail control code for packet processing in the control plane, rather than numerical processing in the data plane. This results in a system with heterogeneous programmable cores and possibly additional hardware based acceleration in the form of either ASIC, FPGA, or soft programmable accelerator form. The line quickly begins to blur between what was once thought “packet/buffer/queue processing”, “DSP code” and “hardware acceleration”. System design becomes increasingly complicated and multi-dimensional.
Partitioning strategies between not only software implementation and hardware implementation, but also heterogeneous multicore systems with software on DSP and software on traditional processor architectures start to become an issue. The problem becomes more daunting as the amount of computation required for things like channel estimation increases. The ability to program these types of machines at not only the assembly level, but even at the C-language level becomes an issue when moving to heterogeneous multicore systems. The need for orthogonal architectures that can be targeted by optimizing compilers at higher and higher abstraction layers starts to become very attractive. The need for abstractions at the Simulink / Matlab level is appealing to end developers who may ultimately have a golden Matlab reference model from which C language or assembly and intrinsics based DSP implementations are derived. As such, a synergy between baseband architecture, compiler infrastructure, programming language model and overall system design is needed to meet the demands of the system in both the control plane and data plane.
Compiler Driven DSP Architecture Design Space Exploration
(2006 – 2009)
Modern embedded signal processing and multimedia workloads exhibit large amounts of both instruction and data level parallelism. Quite often, the amount of computation that can be performed in parallel far exceeds the available functional units even in high performance VLIW style DSPs. RISD, the retargetable compiler infrastructure for scalable DSP architectures allows for rescheduling of code for user definable DSP architectures, and permits rapid design space exploration in tandem with computational demands of the input workload.
Using this compiler toolset, a user can investigate the amount of instruction and data level
parallelism available in their application. Specifically, users can vary the target architecture with regards to the number of VLIW clusters, register file sizes per cluster, functional units per VLIW cluster and functional unit mix per cluster, as well as cross cluster communications interconnect and bandwidth. This new compiler infrastructure is used to investigate the parallelism available in various kernels common to both embedded wireless communications and video processing kernels. Results show the computational and communication interconnect hardware necessary to fully exploit the available instruction and data level parallelism in these workloads. Additionally, comparisons between gate usage efficiency in traditionally clustered VLIW style DSP architectures versus custom FPGA based ASIPs is performed.
Hardware/Software Co-Design Methodologies
(2005 – 2008)
Hardware and software co-design requires a formal methodology for partitioning workloads between
programmable processor cores and custom hardware-like solutions. At the same time, the computational and memory system demands of the application must be considered with respect to the underlying hardware. As functionality is migrated from traditional programmable cores to multi-core systems with additional hardware based coprocessors, both the hardware and software can exhibit varied behavior, and in some cases create new performance bottlenecks. This body of work provides an iterative hardware/software design methodology for partitioning real-time embedded multimedia and wireless applications between software programmable DSPs and hardware based FPGA and ASIC like coprocessors.
By following a strict set of partitioning guidelines, input applications are partitioned between software executing on a programmable DSP cores and hardware based FPGA accelerators to alleviate system bottlenecks in modern VLIW style DSP architectures used in embedded systems. Hardware
topologies are rapidly prototyped and simulated using the Spinach DSP/FPGA simulation infrastructure. This methodology is applied to modern wireless cellular communication workloads, as well as embedded multimedia codecs. Not only do these studies isolate compute and memory bottlenecks in the applications that were previously unseen by conventional simulation toolsets, but as much as 11x increases in performance are seen over conventional codecs compiled for uniprocessor DSP architectures.
Simulation of Heterogeneous DSP/FPGA Based Embedded Architectures
(2003 – 2007)
In recent years, there has been an explosion of growth in the area of embedded computing reaching from military applications, to consumer electronics and automotive control systems. With such a broad set of application domains to cover, the demands on performance, power consumption and robustness of embedded systems are ever increasing. Rather than simply increase clock rates, these types of embedded devices increasingly contain heterogeneous computing elements used to partition the workload and exploit available parallelism. In addition to multiple processor cores, and even heterogeneous processor cores on a within a single embedded system, increasingly FPGA type reconfigurable computing elements are being employed where ASIC devices were once common.
In order to invesigate hardware/software co-design for heterogeneous embedded devices, we present a flexible simulation infrastructure for DSP based heterogeneous system-on-a-chip architectures. The architectures that modelled typically contain one or more Texas Instrument’s C64x series DSPs, MIPS micro-controllers, and reconfigurable FPGA fabrics for computational offloading and partitioning by the compiler. We also include modules to support memory system interconnect, on-chip and off-chip DRAM and SRAM, as well as DMA engines and other system components commonly found in embedded devices. Application level workloads are compiled using various off the shelf production compilers, and these workloads are then run on the proposed simulator topology in a bit-true, verified cycle accurate manner.
Dynamically Reconfigurable Data Caches for Low Power Computing
(2001 – 2003)
In recent years there has been a marked increase in microprocessor power consumption. This has been due to increasingly complex hardware designs, increasing on–chip transistor counts, and increased clock rates. Many modern microprocessor designs also dedicate the majority of their transistors to on–chip instruction and data caches. Because of the large number of transistors which make up on–chip caches, they often account for a large portion of the total power consumed by modern microprocessors.
In order to curb these increasing levels of power consumption, we propose an L1 data cache which can be dynamically reconfigured at program runtime according to the memory traffic patterns of a given application. A two phase approach involving both aggressive compile time analysis of memory traffic patterns, and hardware runtime monitoring of program performance is employed. The compiler models and predicts L1 data cache requirements of loop nests in the input program, and instructs the underlying hardware on how much of the available L1 data cache to power enable during a loop nest’s execution. For regions of the program not analyzable at compile time, the hardware itself monitors program performance and reconfigures the L1 data cache so as to maintain cache performance while minimizing cache power consumption. In addition, in depth studies of memory traffic patterns with respect to data cache performance were performed inside loop nests of the SPEC CPU2000 and Mediabench benchmarks. The sensitivity of data reuses to L1 data cache associativity is analyzed to illustrated the potential power savings a reconfigurable L1 data cache can achieve.