Papers by Ahsan Javed Awan
Euromicro Digital System Design, 2021
Real-world applications are now processing big-data sets, often bottlenecked by the data movement... more Real-world applications are now processing big-data sets, often bottlenecked by the data movement between the compute units and the main memory. Near-memory computing (NMC), a modern data-centric computational paradigm, can alleviate these bottlenecks, thereby improving the performance of applications. The lack of NMC system availability makes simulators the primary evaluation tool for performance estimation. However, simulators are usually time-consuming, and methods that can reduce this overhead would accelerate the earlystage design process of NMC systems. This work proposes Near-Memory computing Profiling and Offloading (NMPO), a highlevel framework capable of predicting NMC offloading suitability employing an ensemble machine learning model. NMPO predicts NMC suitability with an accuracy of 85.6% and, compared to prior works, can reduce the prediction time by using hardwaredependent applications features by up to 3 order of magnitude.
Ericsson Technology Review, 2020
With a vastly distributed system (the telco network) already in place,
the telecom industry has a... more With a vastly distributed system (the telco network) already in place,
the telecom industry has a significant advantage in the transition
toward distributed cloud computing. To deliver best-in-class application
performance, however, operators must also have the ability to fully
leverage heterogeneous compute and storage capabilities.
MECO, 2020
Modern radio telescopes like the Square Kilometer Array (SKA) will need to process in real-time e... more Modern radio telescopes like the Square Kilometer Array (SKA) will need to process in real-time exabytes of radio-astronomical signals to construct a high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the performance bottlenecks due to frequent memory accesses in a state-of-the-art radio-astronomy imaging algorithm. In this paper, we show that a sub-module performing a two-dimensional fast Fourier transform (2D FFT) is memory bound using CPI breakdown analysis on IBM Power9. Then, we present an NMC approach on FPGA for 2D FFT that outperforms a CPU by up to a factor of 120x and performs comparably to a high-end GPU, while using less bandwidth and memory.
pre-print, 2019
Real-time clustering of big performance data generated by the telecommunication networks requires... more Real-time clustering of big performance data generated by the telecommunication networks requires domain-specific high performance compute infrastructure to detect anomalies. In this paper, we evaluate noisy intermediate-scale quantum (NISQ) computers characterized by low decoherence times, for K-means clustering and propose three strategies to generate shorter-depth quantum circuits needed to overcome the limitation of NISQ computers. The strategies are based on exploiting; i) quantum interference, ii) negative rotations and iii) destructive interference. By comparing our implementations on IBMQX2 machine for representative data sets, we show that NISQ computers can solve the K-means clustering problem with the same level of accuracy as that of classical computers.
pre-print, 2019
Support vector machine algorithms are considered essential for the implementation of automation i... more Support vector machine algorithms are considered essential for the implementation of automation in a radio access network. Specifically, they are critical in the prediction of the quality of user experience for video streaming based on device and network-level metrics. Quantum SVM is the quantum analogue of the classical SVM algorithm, which utilizes the properties of quantum computers to speed up the algorithm exponentially. In this work, we derive an optimized preprocessing unit for a quantum SVM that allows classifying any two-dimensional datasets that are linearly separable. We further provide a result readout method of the kernel matrix generation circuit to avoid quantum tomography that, in turn, reduces the quantum circuit depth. We also derive a quantum SVM system based on an optimized HHL quantum circuit with reduced circuit depth. Index Terms-quantum support vector machine, noisy intermediate scale quantum computers, HHL, algorithm
Journal of Microprocessors and Microsystemss, 2019
The conventional approach of moving data to the CPU for computation has become a significant perf... more The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory-called near-memory computing (NMC)-more viable. Processing right at the "home" of data can significantly diminish the data movement problem of data-intensive applications. In this paper, we survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions. We also provide a glimpse of our approach to near-memory computing that includes i) NMC specific microarchitecture independent application characterization ii) a compiler framework to offload the NMC kernels on our target NMC platform and iii) an analytical model to evaluate the potential of NMC.
Euromicro Conference on Digital System Design (DSD), 2019
Near-memory Computing (NMC) promises improved performance for the applications that can exploit t... more Near-memory Computing (NMC) promises improved performance for the applications that can exploit the features of emerging memory technologies such as 3D-stacked memory. However, it is not trivial to find such applications and specialized tools are needed to identify them. In this paper, we present PISA-NMC, which extends a state-of-the-art hardware agnostic profiling tool with metrics concerning memory and parallelism, which are relevant for NMC. The metrics include memory entropy, spatial locality, data-level, and basic-block-level parallelism. By profiling a set of representative applications and correlating the metrics with the application's performance on a simulated NMC system, we verify the importance of those metrics. Finally, we demonstrate which metrics are useful in identifying applications suitable for NMC architectures.
22nd ACM International Workshop on Software and Compilers for Embedded Systems (SCOPES '19), 2019
Emerging computing architectures such as near-memory computing (NMC) promise improved performance... more Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this ongoing work, we extend the state-of-the-art platform-independent software analysis tool with NMC related metrics such as memory en-tropy, spatial locality, data-level, and basic-block-level parallelism. These metrics help to identify the applications more suitable for NMC architectures. CCS CONCEPTS • Software and its engineering → Dynamic analysis.
The conventional approach of moving stored data to the CPU for computation has become a major per... more The conventional approach of moving stored data to the CPU for computation has become a major performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in integration technologies have made the decade-old concept of coupling compute units close to the memory (called Near-Memory Computing) more viable. Processing right at the home of data can completely diminish the data movement problem of data-intensive applications. This paper focuses on analyzing and organizing the extensive body of literature on near-memory computing across various dimensions: starting from the memory level where this paradigm is applied, to the granularity of the application that could be executed on the near-memory units. We highlight the challenges as well as the critical need of evaluation methodologies that can be employed in designing these special architectures. Using a case study, we present our methodology and also identify topics for future research to unlock the full potential of near-memory computing.
Analyzing massive amounts of data and extracting value from it has become key across different di... more Analyzing massive amounts of data and extracting value from it has become key across different disciplines. Clustering is a common technique to find patterns in the data. Existing clustering algorithms require parameters to be set a priori. The parameters are usually determined through trial and error in several iterations or through pre-clustering algorithms, which do not scale well for the massive amounts of data. In this paper, we thus take one such pre-clustering algorithm, Canopy, and develop a parallel version based on MPI. As we show, doing so is not straightforward and without optimization, a considerable amount of time is spent waiting for synchronisation, severely limiting scalability. We thus optimize our approach to spend as little time as possible with idle cores and synchronization barriers. As our experiments show, our approach scales near linear with increasing dataset size.
—Neuromorphic hardware like SpiNNaker offers massive parallelism and efficient communication of s... more —Neuromorphic hardware like SpiNNaker offers massive parallelism and efficient communication of small pay-loads to accelerate the simulation of spiking neurons in neural networks. In this paper, we demonstrate that this hardware is also beneficial for other for applications which require massive parallelism and the large-scale exchange of small messages. More specifically, we study the scalability of PageRank on SpiNNaker and compare it to an implementation on traditional hardware. In our experiments, we show that PageRank on SpiNNaker scales better than on traditional multicore architectures.
While cluster computing frameworks are continuously evolving to provide real-time data analysis c... more While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analyt-ics for being a unified framework for both, batch and stream data processing. There is also a renewed interest in Near Data Processing (NDP) due to technological advancement in the last decade. However, it is not known if NDP archi-tectures can improve the performance of big data processing frameworks such as Apache Spark. In this paper, we build the case of NDP architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and in-storage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.
While cluster computing frameworks are continuously evolving to provide real-time data analysis c... more While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analyt-ics. Recent studies propose scale-in clusters with in-storage processing devices to process big data analytics with Spark However the proposal is based solely on the memory band-width characterization of in-memory data analytics and also does not shed light on the specification of host CPU and memory. Through empirical evaluation of in-memory data analytics with Apache Spark on an Ivy Bridge dual socket server, we have found that (i) simultaneous multi-threading is effective up to 6 cores (ii) data locality on NUMA nodes can improve the performance by 10% on average, (iii) disabling next-line L1-D prefetchers can reduce the execution time by up to 14%, (iv) DDR3 operating at 1333 MT/s is sufficient and (v) multiple small executors can provide up to 36% speedup over single large executor.
—While cluster computing frameworks are continuously evolving to provide real-time data analysis ... more —While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.
Sheer increase in volume of data over the last decade has
triggered research in cluster computin... more Sheer increase in volume of data over the last decade has
triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on
a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behavior with the garbage collector to improve performance of applications between 1.6x to 3x.
In last decade, data analytics have rapidly pro-
gressed from traditional disk-based processing ... more In last decade, data analytics have rapidly pro-
gressed from traditional disk-based processing to modern in-
memory processing. However, little effort has been devoted
in performance enhancement at micro-architecture level. This
paper characterizes the performance of in-memory data analytic
applications using Apache Spark framework. It uses a single
node NUMA machine and identifies the bottlenecks hampering
the scalability of the workloads. In doing so, it quantifies the
inefficiencies at micro architectural level for various big data
applications. Through empirical evaluation, we show that spark
workloads do not scale linearly beyond twelve threads, due to
work time inflation and thread level load imbalance. Further,
at the micro-architecture level, we observe that memory bound
latency is one of the major cause of work time inflation
In this paper, the design space that optimizes the performance of operational amplifier in terms ... more In this paper, the design space that optimizes the performance of operational amplifier in terms of current consumption and unity gain band width product has been explored using Cadence. A two stage indirect compensated active load cascode operational amplifier with a current consumption of 335uA and a speed as high as 23MHz has been presented. A novel indirect compensated multistage opamp designed in AMS 0.35um technology further reduces the current to 120uA and increases the speed to 35MHz. The layout of both designs incorporates Common Centroid method for improved matching among devices
Brain Computer Interface (BCI) is a
communication system, which avoiding the brain's normal
out... more Brain Computer Interface (BCI) is a
communication system, which avoiding the brain's normal
output pathways of muscles and peripheral nerves and allows a
patient to control its external world only by means of brain
signals. For successful implementation of BCI, dimensionality
reduction and classification are fundamental task. In this
paper, we used a publically available EEG signals data of the
Upper Limb Motion. First the dimensionality of the data is
being reduced by using Principal Component Analysis (PCA)
followed by classification of the reduced dimensioned dataset
by well-known classifiers e.g. Artificial Neural Networks
(ANN), Linear Discriminant Analysis (LDA) and Decision
trees (DT). To identify a classifier which does the classification
task more efficiently, we compare their performances on the
basis of Confusion Matrices and Percentage Accuracies. The
experimental results show that ANN is the best classifier for
the classification of brain signals and has the percentage
accuracy of 81.6%.
Use of unmanned Aerial Vehicles (UAVs) has gained significant importance in the recent years beca... more Use of unmanned Aerial Vehicles (UAVs) has gained significant importance in the recent years because of their ability to remotely monitor and perform various tasks in an autonomous manner. However, the control unit of such UAVs fails to adapt quickly when the UAVs are exposed to unpredictable and violent external disturbances such as violent wind gusts and extreme weather conditions. The cost of such adaptation failures can be extremely high and therefore, in order to use any crash preventing strategy, it is imperative to design and use intelligent tools for the early detection of such failures. In this paper we present a machine learning based autonomous tool -AWG-Detector -that detects Anomalies due to Wind Gusts (AWG), in our adaptive Altitude control unit of an Aerosonde UAV. This adaptive Altitude control unit comprises of a PI based Roll controller and a Hybrid neurofuzzy based Pitch controller. Experimental results show that our AWG-Detector achieves an accuracy of more than 99% in detecting anomalies due to wind gusts. To the best of our knowledge, this is the first study that targets the detection of Wind Gust anomalies in the Altitude control unit of an Aerosonde UAV by developing a comparison of five well-known machine learning techniques.
Thesis Chapters by Ahsan Javed Awan
The sheer increase in the volume of data over the last decade has triggered research in cluster c... more The sheer increase in the volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted to understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers. Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation (the additional CPU time spent by threads in a multi-threaded computation beyond the CPU time required to perform the same work in a sequential computation). We have also found that workloads
are bound by the latency of frequent data accesses to the memory. By
enlarging input data size, application performance degrades significantly due to the substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization).
For data accesses, we have found that simultaneous multi-threading is
effective in hiding the data latencies. We have also observed that (i) data
locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by upto 14%. For garbage collection impact, we match memory behavior with the garbage collector to improve the performance of applications between 1.6x to 3x and recommend using multiple small Spark executors that can provide up to 36% reduction in execution time over single large executor. Based on the characteristics of workloads, the thesis envisions near-memory and near storage hardware acceleration to improve the single-node performance of scale-out frameworks like Apache Spark. Using modeling techniques, it
estimates the speed-up of 4x for Apache Spark on scale-up servers augmented with near-data accelerators.
Uploads
Papers by Ahsan Javed Awan
the telecom industry has a significant advantage in the transition
toward distributed cloud computing. To deliver best-in-class application
performance, however, operators must also have the ability to fully
leverage heterogeneous compute and storage capabilities.
triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on
a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behavior with the garbage collector to improve performance of applications between 1.6x to 3x.
gressed from traditional disk-based processing to modern in-
memory processing. However, little effort has been devoted
in performance enhancement at micro-architecture level. This
paper characterizes the performance of in-memory data analytic
applications using Apache Spark framework. It uses a single
node NUMA machine and identifies the bottlenecks hampering
the scalability of the workloads. In doing so, it quantifies the
inefficiencies at micro architectural level for various big data
applications. Through empirical evaluation, we show that spark
workloads do not scale linearly beyond twelve threads, due to
work time inflation and thread level load imbalance. Further,
at the micro-architecture level, we observe that memory bound
latency is one of the major cause of work time inflation
communication system, which avoiding the brain's normal
output pathways of muscles and peripheral nerves and allows a
patient to control its external world only by means of brain
signals. For successful implementation of BCI, dimensionality
reduction and classification are fundamental task. In this
paper, we used a publically available EEG signals data of the
Upper Limb Motion. First the dimensionality of the data is
being reduced by using Principal Component Analysis (PCA)
followed by classification of the reduced dimensioned dataset
by well-known classifiers e.g. Artificial Neural Networks
(ANN), Linear Discriminant Analysis (LDA) and Decision
trees (DT). To identify a classifier which does the classification
task more efficiently, we compare their performances on the
basis of Confusion Matrices and Percentage Accuracies. The
experimental results show that ANN is the best classifier for
the classification of brain signals and has the percentage
accuracy of 81.6%.
Thesis Chapters by Ahsan Javed Awan
are bound by the latency of frequent data accesses to the memory. By
enlarging input data size, application performance degrades significantly due to the substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization).
For data accesses, we have found that simultaneous multi-threading is
effective in hiding the data latencies. We have also observed that (i) data
locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by upto 14%. For garbage collection impact, we match memory behavior with the garbage collector to improve the performance of applications between 1.6x to 3x and recommend using multiple small Spark executors that can provide up to 36% reduction in execution time over single large executor. Based on the characteristics of workloads, the thesis envisions near-memory and near storage hardware acceleration to improve the single-node performance of scale-out frameworks like Apache Spark. Using modeling techniques, it
estimates the speed-up of 4x for Apache Spark on scale-up servers augmented with near-data accelerators.
the telecom industry has a significant advantage in the transition
toward distributed cloud computing. To deliver best-in-class application
performance, however, operators must also have the ability to fully
leverage heterogeneous compute and storage capabilities.
triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on
a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behavior with the garbage collector to improve performance of applications between 1.6x to 3x.
gressed from traditional disk-based processing to modern in-
memory processing. However, little effort has been devoted
in performance enhancement at micro-architecture level. This
paper characterizes the performance of in-memory data analytic
applications using Apache Spark framework. It uses a single
node NUMA machine and identifies the bottlenecks hampering
the scalability of the workloads. In doing so, it quantifies the
inefficiencies at micro architectural level for various big data
applications. Through empirical evaluation, we show that spark
workloads do not scale linearly beyond twelve threads, due to
work time inflation and thread level load imbalance. Further,
at the micro-architecture level, we observe that memory bound
latency is one of the major cause of work time inflation
communication system, which avoiding the brain's normal
output pathways of muscles and peripheral nerves and allows a
patient to control its external world only by means of brain
signals. For successful implementation of BCI, dimensionality
reduction and classification are fundamental task. In this
paper, we used a publically available EEG signals data of the
Upper Limb Motion. First the dimensionality of the data is
being reduced by using Principal Component Analysis (PCA)
followed by classification of the reduced dimensioned dataset
by well-known classifiers e.g. Artificial Neural Networks
(ANN), Linear Discriminant Analysis (LDA) and Decision
trees (DT). To identify a classifier which does the classification
task more efficiently, we compare their performances on the
basis of Confusion Matrices and Percentage Accuracies. The
experimental results show that ANN is the best classifier for
the classification of brain signals and has the percentage
accuracy of 81.6%.
are bound by the latency of frequent data accesses to the memory. By
enlarging input data size, application performance degrades significantly due to the substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization).
For data accesses, we have found that simultaneous multi-threading is
effective in hiding the data latencies. We have also observed that (i) data
locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by upto 14%. For garbage collection impact, we match memory behavior with the garbage collector to improve the performance of applications between 1.6x to 3x and recommend using multiple small Spark executors that can provide up to 36% reduction in execution time over single large executor. Based on the characteristics of workloads, the thesis envisions near-memory and near storage hardware acceleration to improve the single-node performance of scale-out frameworks like Apache Spark. Using modeling techniques, it
estimates the speed-up of 4x for Apache Spark on scale-up servers augmented with near-data accelerators.
Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation. We have also found that workloads are bound by the latency of frequent data accesses to DRAM. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization).
For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 14%. For GC impact, we match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x. and recommend to use multiple small executors that can provide up-to 36% speedup over single large executor.
desktop PC which is not an efficient solution in terms of area, power, cost and mobility. This thesis concerns the use of FPGA as an implementation platform for NOILC. In this regard floating point norm optimal iterative learning controller for Gantry robot is developed
and synthesized on Vertex 5 Xc5vlx110T FPGA chip. The comparison with general purpose processor based implementation
shows that the proposed FPGA implementation reduces the
execution time from 830ms to 1.47ms.