Group 3 - Project - Paper Review - Comp Arc

The document discusses two research papers about computer architecture and organization. The first paper proposes a design for a quantum pipeline to enable effective execution of quantum instructions. The second paper establishes a high-speed transmission link between devices using PCI-E. The third paper proposes a cache access reordering tree (CART) to improve efficiency of cache and memory accesses in GPUs.

Uploaded by

devaneyan kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Group 3 - Project - Paper Review - Comp Arc

Uploaded by

devaneyan kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

TAO1221

COMPUTER ARCHITECTURE AND ORGANISATION

PROJECT
Trimester 2, Session 2022 / 2023 (2220)

No. Student ID Student Name Majoring Signature

1. 1221304569 DEVANEYAN MUNIANDY AI

2. 1221304737 ANF YONG EN AI

3. 1221304275 ARULI BIA

4. 1221305260 RYAN ONG MING HONG AI

1.0 Paper Reviews

A Quantum Pipeline for an Executable Quantum Instruction Set Architecture

Suvadip Batabyal, Lovekush Sharma (2020, July 8).
A Quantum Pipeline for an Executable Quantum Instruction Set Architecture. IEEE Xplore.
Retrieved
March 27, 2022, from https://ieeexplore.ieee.org/document/9155026

The development of a quantum pipeline that supports an executable quantum instruction

set architecture (ISA) is discussed in the paper "A Quantum Pipeline for an Executable Quantum
Instruction Set Architecture". A specific set of instructions must be used to manipulate and operate
on quantum bits, also known as qubits, in quantum computing, which uses the concepts of quantum
physics to carry out complex calculations. The authors suggest a design for a quantum pipeline, or
a series of processing steps that enables the effective execution of quantum instructions. The
pipeline is in charge of controlling the flow of quantum data, coordinating the execution of
instructions, and enhancing the overall efficiency of quantum computing. The article goes through
the main elements and phases of the proposed quantum pipeline, such as quantum instruction
decoding, qubit allocation, gate execution, and result storage. In order to ensure the proper
execution and synchronization of quantum instructions, as well as to reduce mistakes and increase
computing efficiency, each stage is essential. The essay also emphasizes the difficulties and factors
to be taken into account while constructing a quantum pipeline for an executable quantum ISA..

The writers, Suvadip Batabyal and LoveKush Sharma, have complement the research
properly. They wrote the paper in a completely prepared and easy manner for the reader to study.
In addition, they have a research phase on the maximum essential topics, such as the advent
wherein he provided a short evaluate of the quantum pipeline and highlighted the variations
between classical computers and quantum computer systems, related works mentions a few hints,
including Proposed a quantum router the use of IBM's 5-bit quantum pc, initial ideas in which the
definitions of the maximum essential factors had been stated, and then they cited proposed plan
and findings. They clarified the sections with graphs, mathematical issues and some conclusions.
The creator understood the topic well, but there are some major comments as well. The
author claims that the proposed QISA is more scalable than the MIS QISA. However, this is not
necessarily true. The MIS QISA can be scaled to support more qubits by simply adding more
registers. The proposed QISA, on the other hand, is limited by the number of opcodes that can be
decoded in parallel. Then, the author claims that the proposed QuPA can execute quantum
programs with much less quantum cost than a general serial architecture. However, this is not
necessarily true. The gate cost of the proposed QuPA is O(2m), where m is the number of opcodes
that can be decoded in parallel. The gate cost of a general serial architecture is O(2n), where n is
the number of qubits. If m is much smaller than n, then the proposed QuPA may not be able to
execute quantum programs with much less quantum cost than a general serial architecture.
Moreover, the authors claims that the proposed QISA and QuPA can be used to implement a wide
range of quantum algorithms. However, this is not necessarily true. The proposed QISA and QuPA
are designed for a specific type of quantum algorithm, namely, those that can be expressed as a
sequence of quantum gates. Other types of quantum algorithms, such as those that involve
measurement, may not be able to be implemented using the proposed QISA and QuPA.

However, the article does not provide any concrete solutions to these challenges. More
research is needed to develop a quantum pipeline architecture that can be used to execute quantum
programs efficiently. The authors could have improved the clarity of their writing by using more
specific examples and by providing more details about the methods they used. The authors could
have strengthened the argument of their paper by providing more experimental results and by
comparing the performance of their proposed architecture to other quantum pipeline architectures.
The authors could have made their paper more accessible to a wider audience by providing more
background information on quantum computing and by using more non-technical language.
Design and Implementation of High-Speed Transmission Link Based on

PCI-E

Meng, E., & Bu, X. (2020, June 23). Design and implementation of high-speed
transmission link based on PCI-e. IEEE Xplore. Retrieved March 27, 2022, from
https://ieeexplore.ieee.org/document/9123289

Establishing a fast and reliable connection between two devices using the PCI-E interface.
PCI-E is a widely used high-speed serial computer expansion bus standard that provides high-
bandwidth communication between various components in a computer system. A practical design
of transmission link based on Rocket IO, PCI Express (PCI-E) and DDR3 is proposed. Large-
capacity dynamic First Input First Output (FIFO) has been designed with the aid of DDR3 to
prevent the data loss caused by PCI-E interrupts. Moreover, the design of PCI-E Direct Memory
Access (DMA) mode is also demonstrated.

After finished studying the research paper, I have a deeper understanding about the design
and implementation of high-speed transmission link based on PCI-E. The research paper is full
with useful and helpful information such as ways to increase the speed of data transmission or
achieve the higher data transfer rates. Besides, I have noticed that in many multi-core processors,
memory management is one of the most critical responsibilities. The process of parametric
presentation of speech signal, spectral analysis and signal segmentation into frames is also goes
over according to this paper. It ensured that the signal frames size matches the cache block size for
parallel processing.

The paragraph is explaining about the efficiency of the cache in a hierarchical memory
system. The most important part such as cache capacity, block size, a method of mapping main
memory to cache memory, the algorithm for replacing information in a full cycle, the algorithm
for matching the contents of the main and cache memory and the number of cache level. These are
the factors we need to consider. The improving in the parallel signal processing capability on multi-
core processor has a greater impact on some of them. As a result, we are able to employ these to
build an accelerated spectrum analysis algorithm in our method.
The writer explains the cache capacity and mentioned it always trade-off when it comes to
cache capacity. The cache memory should be small so that the cost parameters are similar to those
of the RAM, or else it should be large so that the access time to cache memory determines the
average access time in a system with both primary and cache memory. There are a number of ways
or techniques to avoid wasted memory bandwidth. However, we strongly suggest a strategy that
improves computing speed by developing a novel method that matches the size of signal frames
to the block size of cache memory. This type of optimization can have a significant impact on
parallel processing overall speed. However, it can currently be utilized in digital signal processing
to divide signals into frames by using multi-core computers. Based on the width of the data bus
connecting the cache memory to the main memory and the size of the cache memory's block it
normally been chosen on a modest scale. We demonstrate the way that how to make the best usage
of them in parallel computing. The existence of vector-matrix effects in digital signal processing
procedures should be modified in line as well as the size of their streams with the size of cache
blocks.

The writer also compared the result of the performance when using a various type of
parallel technologies. Due to the faster memory is really costly to get and a memory hierarchy is
divided into numerous levels which is smaller, faster and more expensive than the level below
which located further away from CPU. The main idea is to create a memory system that is cheaper
and faster level of memory. Within the scope of specific size of frames of input signals, a new
parallel method may be more efficient than the older one. In order to achieve the best performance
of a new algorithm which can speed up the operation rate, each of the multi-core processor is being
approached individually.
Cache Access Reordering Tree for Efficient
Cache and Memory Accesses in GPUs

Gu, Y., & Chen, L. (2019, January 17). CART: Cache access reordering
tree for efficient cache and memory accesses in GPUs. IEEE Xplore.
Retrieved March 20, 2022, from
https://ieeexplore.ieee.org/document/8615696

GPUs (graphics processing units) are increasingly being employed to speed up general-
purpose computing. The thousands of threads working concurrently on a GPU demand a highly
efficient memory subsystem for data supply. The order of memory accesses has a considerable
impact on the memory subsystem. Despite the fact that reordering memory accesses at the L2
cache offers several potential benefits for both cache and DRAM, little work has been done to put
it to use. The author reveals the previously unknown option of rearranging L2 cache access in this
study. The paper proposes Cache Access Reordering Tree (CART), a novel architecture that can
improve memory subsystem efficiency by actively reordering memory accesses at L2 cache to be
cache-friendly and DRAM-friendly. The proposed CART improves the average IPC of memory
intensive benchmarks by 34.2 percent with only 1.7 percent area overhead, according to evaluation
results using a diverse set of benchmarks.

To begin, I noticed how accurately the paper's paragraphs were labeled. For example, the
first paragraph was labeled "Introduction," and the paragraph's main focus was to briefly introduce
the reader to the main problem, which in this case was how the GPU and SIMIT generate a high
number of memory requests, causing high pressure on the memory subsystem, causing the GPU
to be unable to reach its peak performance. The reader has labeled the paragraphs precisely in the
following paragraphs as well. Another positive aspect of the work that I noticed was how the writer
used diagrams, photos, and tables to properly convey the information he wanted to transmit to the
reader. Aside from that, I loved how, after addressing a problem, the reader immediately provides
a solution to that problem. For example, in paragraph (D.Drain Policy), the author stated that "no
requests are selected for Bank 2 and Bank 3, as they are not working during this time."
One thing I noticed when reading the paper was how complex the writer wrote it; I believe
he should have written some parts more simply in order for the reader to comprehend and fully
understand the information the writer wishes to give to the reader. Apart from that, I like how the
writer provided a very clear and detailed overview before proposing the Cache Access Reordering
Tree (CART). I found the research paper coherent and justified as the writer provided the readers
with statistics which tells us that a large amount of data was collected. Other than that, when I
reached the end of the third paragraph (Impacts of Access Order on DRAM), the writer switched
from writing on the left side to writing on the right side. However, the writer added a diagram at
the top of the right side which made me think that it was the end of the paragraph, but after looking
closely, I noticed how the writer actually continued that paragraph on the right side. I believe it
would have been better if the writer continued the paragraph normally and then added the diagram
after. In this way he would have not split the paragraph in half and cause confusion for the reader.

While reading the paper I noticed many minor grammatical errors for example in the first
paragraph the writer wrote “which allows thousands of threads running simultaneously” however
it should instead be “which allows thousands of threads to run simultaneously”, Additionally there
has been another grammatical error in part ‘VI Results and Analysis’ the writer wrote “While
CART works better with more leaf queues, it may not necessary to provide a large” however it
should be written as “While CART works better with more leaf queues, it may not be necessary to
provide a large”. And many more similar mistakes.
A Method of Mapping a Block of Main Memory to Cache in Parallel
Processing of the Speech Signal

Musaev, M., & Rakhimov, M. (2020, February 27). A method of mapping a block of
main memory to cache in parallel processing of the speech signal. IEEE Xplore. Retrieved
March 27, 2022, from
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9011946

This paper proposes a new method for achieving maximum speed up in parallel signal
processing by utilizing multi-core processor caches. The focus is on optimizing the cache memory
and creating fast algorithms using Intel's advanced tools. The proposed method involves parallel
processing of speech signals through spectral analysis, employing a new parallel algorithm. The
results demonstrate that aligning the signal frame size with the size of the L3 cache blocks leads
to significant acceleration in signal spectrum calculations.

The authors' innovative approach introduces a novel method for achieving efficient parallel
processing by leveraging multi-core processor caches. Emphasizing the importance of cache
optimization, the paper highlights the need to match the signal frame size with the cache block size
for optimal performance. The practical implementation of the proposed method using Intel's
advanced tools demonstrates its feasibility. The thorough performance analysis, comparing quad-
core and eight-core processors, adds empirical evidence to support the effectiveness of the
approach. Overall, the paper's relevance, practicality, and rigorous analysis make it a valuable
contribution to the field, addressing an important problem in parallel speech signal processing.

First, it makes notable contributions to the field of parallel signal processing with its
comprehensive and innovative approach. One of its key strengths lies in the authors' insightful
treatment of the challenge of cache optimization, a critical aspect of parallel computing. By
focusing on aligning the signal frame size with the cache block size, the authors demonstrate their
meticulous attention to detail in achieving efficient memory management. This emphasis on cache
optimization reflects their deep understanding of the underlying hardware architecture and its
impact on performance.
Moreover, the paper stands out due to its practical implementation using Intel's advanced
tools, namely Thread Building Blocks (TBB) and OpenMP. Leveraging these widely adopted tools
enhances the paper's credibility, as it showcases a realistic and industry-standard approach to
achieving efficient parallel processing. The authors' ability to translate their proposed method into
practical implementation strategies adds significant value to the research.

The paper's strength is further amplified by its extensive performance analysis. By

thoroughly evaluating the proposed approach using different processor configurations, including
quad-core and eight-core processors, the authors provide compelling evidence of its effectiveness.
This in-depth analysis sheds light on the scalability and performance benefits of their method,
enabling readers to gain valuable insights into its practical implications.

In summary, the paper excels in several aspects, including its careful treatment of cache
optimization, practical implementation using established tools, and comprehensive performance
analysis. These strengths highlight the authors' expertise in parallel signal processing and
contribute to the advancement of the field. By addressing critical challenges and providing
empirical evidence of the proposed method's effectiveness, the paper serves as a valuable resource
for researchers and practitioners in the domain of parallel computing.

The introduction could benefit from providing a more explicit explanation of the specific
challenges or limitations in existing methods for parallel signal processing. This would help
readers understand the context and motivation for the proposed approach and highlight the unique
contribution of the research.

While the use of Intel's advanced tools, such as Thread Building Blocks (TBB) and
OpenMP, is mentioned, it would be valuable to provide additional details on how these tools were
applied in the practical implementation. Describing specific features or functionalities utilized
from these tools would enhance the understanding of their role in achieving efficient parallel
processing.