Papers by Sudarshan Banerjee
19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems Design (VLSID'06), 2006
Performance of applications can be boosted by executing application-specific Instruction Set Exte... more Performance of applications can be boosted by executing application-specific Instruction Set Extensions (ISEs) on a specialized hardware coupled with a processor core. Many commercially available customizable processors have communication overheads in their interface with the specialized hardware. However, existing ISE generation approaches have not considered customizable processors that have communication overheads at their interface. Furthermore, they have not characterized the energy benefits of such ISEs. We present a softprocessor customization framework that takes an input 'C' application and realizes a customized processor capturing the microarchitectural details of its interface with the specialized unit. We are able to accurately measure the speedup, energy, power and code size benefits of our ISE approach on a real system implementation by applying the design flow to a popular Xilinx Microblaze soft-processor core synthesized for four real-life applications. We show that only one large ISE per application is sufficient to get an average 1.41× speedup over pure software execution in spite of incurring communication overheads in the ISE implementation. We also observe a simultaneous savings in energy (up to 40%) and power (up to 12% peak power reduction) with this increased performance.
Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, 2006
Real-time multi-media applications are increasingly being mapped onto MPSoC (multi-processor syst... more Real-time multi-media applications are increasingly being mapped onto MPSoC (multi-processor system-on-chip) platforms containing hardware-software IPs (intellectual property) along with a library of common scheduling policies such as EDF, RM. The choice of a scheduling policy for each IP is a key decision that greatly affects the design's ability to meet real-time constraints, and also directly affects the energy consumed by the design. We present a cosynthesis framework for design space exploration that considers heterogenous scheduling while mapping multimedia applications onto such MPSoCs. In our approach, we select a suitable scheduling policy for each IP such that system energy is minimized-our framework also includes energy reduction techniques utilizing dynamic power management. Experimental results on a realistic multi-mode multi-media terminal application demonstrate that our approach enables us to select design points with up to 60.5% reduced energy for a given area constraint, while meeting all real-time requirements. More importantly, our approach generates a tradeoff space between energy and cost allowing designers to comparatively evaluate multiple system level mappings.
13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05)
ABSTRACT
Partial dynamic reconfiguration, often called RTR (run-time reconfiguration) is a key feature in ... more Partial dynamic reconfiguration, often called RTR (run-time reconfiguration) is a key feature in modern reconfigurable platforms. While partial RTR enables additional application performance, it imposes physical constraints necessitating simultaneous scheduling and placement while mapping application task graphs onto such architectures. In this report we present PARLGRAN, an approach that maximizes performance of application task chains by selecting a suitable granularity of data-parallelism for individual data parallel tasks. Our approach focusses on reconfiguration delay overhead and placement-related issues (such as fragmentation) while selecting individual data-parallelism granularity as an integral part of simultaneous scheduling and placement. As a key step to validating our proposed heuristic, we have additionally formulated (and implemented) an exact strategy (ILP). We demonstrate that our heuristic generates high-quality schedules by: (a) comparing our heuristic with the exact strategy on small testcases (b) a very large set of synthetic experiments with over a thousand data-points where we compare our results with a simpler strategy that tries to statically maximize data-parallelism, i.e., does not consider the overheads and constraints associated with partial RTR (c) a detailed application case study of JPEG encoding. The detailed case-study confirms that blindly maximizing data-parallelism can result in schedules even worse than that generated by a simple (but RTR-aware) approach oblivious to data-parallelism. Last, but very important, we demonstrate that our approach is well-suited for true on-demand computing-detailed execution time estimates of our heuristic indicate that the heuristic execution time is comparable to hardware task execution time making it feasible to integrate our heuristic in a run-time scheduling approach.
Customization of processor architectures through Instruction Set Extensions (ISEs) is an effectiv... more Customization of processor architectures through Instruction Set Extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation ap-proach needs to obtain results close to those achieved by experienced designers, particularly for complex applications that exhibit regularity: expert designers are able to exploit manually such regularity in the data flow graphs to generate high-quality ISEs. In this report, we present ISEGEN, an approach that identifies high-quality ISEs by iterative improvement following the basic principles of the well-known Kernighan-Lin (K-L) min-cut heuristic. Experimental results on a number of MediaBench, EEMBC and cryptographic applications show that our approach matches the quality of the optimal solution obtained by exhaustive search. We also show that our ISEGEN technique is on average faster than a genetic formulation that generates equivalent solutions. Furthermore, the ISEs identified ...
Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis - CODES+ISSS '04, 2004
Hardware/software (HW-SW) partitioning is a key problem in the codesign of embedded systems, stud... more Hardware/software (HW-SW) partitioning is a key problem in the codesign of embedded systems, studied extensively in the past. One major open challenge for traditional partitioning approaches-as we move to more complex and heterogeneous SOCs-is the lack of efficient exploration of the large space of possible HW/SW configurations, coupled with the inability to efficiently scale up with larger problem sizes. In this paper, we make two contributions for HW-SW partitioning of applications represented as procedural callgraphs: 1) we prove that during partitioning, the execution time metric for moving a vertex needs to be updated only for the immediate neighbours of the vertex, rather than for all ancestors along paths to the root vertex; consequently, we observe faster run-times for move-based partitioning algorithms such as Simulated Annealing (SA), allowing call graphs with thousands of vertices to be processed in less than a second, and 2) we devise a new cost function for SA that allows frequent discovery of better partitioning solutions by searching spaces overlooked by traditional SA cost functions. We present experimental results on a very large design space, where several thousand configurations are explored in minutes as compared to several hours or days using a traditional SA formulation. Furthermore, our approach is frequently able to locate better design points with over 10 % improvement in application execution time compared to the solutions generated by a Kernighan-Lin partitioning algorithm starting with an all-SW partitioning.
Proceedings of the 14th ACM Great Lakes symposium on VLSI, 2004
As the design community moves towards architecting multiprocessor systems-on-chip (MPSoC), it is ... more As the design community moves towards architecting multiprocessor systems-on-chip (MPSoC), it is widely believed that an on-chip interconnection network is potentially the best candidate to satisfy the high aggregate throughput needed by dozens of IP blocks. In this context, power (energy) estimation and reduction techniques for switches and links, the core components of an interconnection network, gain added significance. FIFO buffers are a key component of a majority of network switches-buffers have been estimated to be the single largest power consumer for a typical switch in an on-chip network. In this report, we analyze energy-power characteristics of FIFOs for onchip networks and propose an optimization to reduce FIFO energy consumption in the context of an on-chip network. Our experimental results demonstrate promising reductions in energy consumptions (19-33% for 256 and 512 bit wide links). Furthermore, our approach yields increasing energy reduction for wider links that are very likely to be used in future on-chip networks.
2006 IEEE International Conference on Field Programmable Technology, 2006
ABSTRACT
2007 25th International Conference on Computer Design, 2007
2007 44th ACM/IEEE Design Automation Conference, 2007
ABSTRACT
Proceedings. 42nd Design Automation Conference, 2005., 2005
Many reconfigurable architectures offer partial dynamic configurability, but current system-level... more Many reconfigurable architectures offer partial dynamic configurability, but current system-level tools cannot guarantee feasible implementations when exploiting this feature. We present a physically aware hardware-software (HW-SW) scheme for minimizing application execution time under HW resource constraints, where the HW is a reconfigurable architecture with partial dynamic reconfiguration capability. Such architectures impose strict placement constraints that lead to implementation infeasibility of even optimal scheduling formulations that ignore the nature of these constraints. We propose an exact and a heuristic formulation that simultaneously partition, schedule, and do linear placement of tasks on such architectures. With our exact formulation, we prove the critical nature of placement constraints. We demonstrate that our heuristic generates high-quality schedules by comparing the results with the exact formulation for small tests and a popular, but placementuanaware scheduling heuristic for larger tests. With a case study, we demonstrate extension of our approach to handle heterogenous architectures with specialized resources distributed between general purpose programmable logic columns. The execution time of our heuristic is very reasonable-task graphs with hundreds of nodes are processed in a couple of minutes.
Asia and South Pacific Conference on Design Automation, 2006.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2009
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2006
Customization of processor architectures through instruction set extensions (ISEs) is an effectiv... more Customization of processor architectures through instruction set extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation approach needs to obtain results close to those achieved by experienced designers, particularly for complex applications that exhibit regularity: expert designers are able to exploit manually such regularity in the data flow graphs to generate high-quality ISEs. In this paper, we present ISEGEN, an approach that identifies high-quality ISEs by iterative improvement following the basic principles of the well-known Kernighan-Lin min-cut heuristic. Experimental results on a number of MediaBench, EEMBC, and cryptographic applications show that our approach matches the quality of the optimal solution obtained by exhaustive search. We also show that our ISEGEN technique is on average 20 faster than a genetic formulation that generates equivalent solutions. Furthermore, the ISEs identified by our technique exhibit 35% more speedup than the genetic solution on a large cryptographic application by effectively exploiting its regular structure.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2006
ACM Transactions on Reconfigurable Technology and Systems, 2010
Partial dynamic reconfiguration (often referred to as partial RTR) enables true on-demand computi... more Partial dynamic reconfiguration (often referred to as partial RTR) enables true on-demand computing. In an on-demand computing environment, a dynamically invoked application is assigned resources such as data bandwidth, configurable logic. The limited logic resources are customized during application execution by exploiting partial RTR. In this article, we propose an approach that maximizes application performance when available bandwidth and logic resources are limited. Our proposed approach is based on theoretical principles of minimizing application schedule length under bandwidth and logic resource constraints. It includes detailed microarchitectural considerations on a commercially popular reconfigurable device, and it exploits partial RTR very effectively by utilizing data-parallelism property of common image-processing applications. We present extensive application case studies on a cycle-accurate simulation platform that includes detailed resource considerations of the Xilin...
ACM Transactions on Embedded Computing Systems, 2008
Real-time multimedia applications are increasingly being mapped onto MPSoC (multiprocessor system... more Real-time multimedia applications are increasingly being mapped onto MPSoC (multiprocessor system-on-chip) platforms containing hardware--software IPs (intellectual property), along with a library of common scheduling policies such as EDF, RM. The choice of a scheduling policy for each IP is a key decision that greatly affects the design's ability to meet real-time constraints, and also directly affects the energy consumed by the design. We present a cosynthesis framework for design space exploration that considers heterogeneous scheduling while mapping multimedia applications onto such MPSoCs. In our approach, we select a suitable scheduling policy for each IP such that system energy is minimized—our framework also includes energy-reduction techniques utilizing dynamic power management. Experimental results on a realistic multimode multimedia terminal application demonstrate that our approach enables us to select design points with up to 60.5% reduced energy for a given area co...
Design, Automation and Test in Europe
Customization of processor architectures through Instruction Set Extensions (ISEs) is an effectiv... more Customization of processor architectures through Instruction Set Extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation approach needs to obtain results close to those achieved by experienced designers, particularly for complex applications that exhibit regularity: expert designers are able to exploit manually such regularity in the data flow graphs to generate high-quality ISEs. In this paper, we present ISEGEN, an approach that identifies high-quality ISEs by iterative improvement following the basic principles of the well-known Kernighan-Lin (K-L) min-cut heuristic. Experimental results on a number of MediaBench, EEMBC and cryptographic applications show that our approach matches the quality of the optimal solution obtained by exhaustive search. We also show that our ISEGEN technique is on average ¾¼¢ faster than a genetic formulation that generates equivalent solutions. Furthermore, the ISEs identified by our technique exhibit ¿ ± more speedup than the genetic solution on a large cryptographic application (AES) by effectively exploiting its regular structure.
Uploads
Papers by Sudarshan Banerjee