0% found this document useful (0 votes)
61 views33 pages

CS683 Project Presentation (Helloki)

The document describes Orinoco, an out-of-order processor that uses ordered issue and unordered commit with non-collapsible queues. It utilizes various matrix structures like age matrices, commit dependency matrices, and wakeup matrices to track dependencies between instructions and enable efficient scheduling. These matrices are implemented using processing-in-memory techniques for reduced complexity and timing constraints compared to traditional implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views33 pages

CS683 Project Presentation (Helloki)

The document describes Orinoco, an out-of-order processor that uses ordered issue and unordered commit with non-collapsible queues. It utilizes various matrix structures like age matrices, commit dependency matrices, and wakeup matrices to track dependencies between instructions and enable efficient scheduling. These matrices are implemented using processing-in-memory techniques for reduced complexity and timing constraints compared to traditional implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CS-683 Project Checkpoint 1 Presentation

Orinoco: Ordered Issue and Unordered


Commit with Non-Collapsible Queues
Team Helloki

Atharva Kulkarni Avaneesh Sai Puja Kudupudi


210070047@iitb.ac.in 210070015@iitb.ac.in 210070046@iitb.ac.in

1
Challenges for Modern O3 Processors
● Moore's law has slowed down but still functional
● Microarchitectures becoming wider and deeper
● Full potential is not harnessed due to inefficient use of
resources
● Survey of Google and Facebook services:
○ Only 20-40% instructions retired without stalling
○ Actual bandwidth is 1/3rd of theoretical bandwidth

Challenges for Modern O3 Processors 2


Solutions?
● Need aggressive instruction scheduling
● O3 Commit comes in
● Scheduling: NP hard problem
● Being on the critical path of pipeline, it should be done
with minimum latency without complex hardware

Solutions? 3
Qs in O3
● Issue Q (IQ)
● Re-Order Buffer (ROB)
● Load/ Store Q (LSQ)
● They determine temporal order of instructions

Qs in O3 4
Existing O3 Commit Techniques
● Existing methods lack in capacity efficiency of
scheduling structures while preserving ideal instruction
ordering

Existing O3 Commit Techniques 5


Existing O3 Commit Techniques
● Prioritize certain instructions at commit
● But they struggle with reclaiming the gaps left in Qs
● Main Challenge: Physical ordering of instructions
within Qs

Existing O3 Commit Techniques 6


Orinoco
● Ordered issue and unordered commit
● Non collapsible IQ, ROB, LSQ
● Decouples temporal order from positions in Qs

Ordered
issue Unordered
commit

Orinoco 7
Age Matrices
● Track relative age in ROB and IQ
● # rows = # cols = # entries in respective Qs
● After decoding, renaming and dispatching to the Q,
valid (VLD) vector tracks entries
● Operation: When dispatched, set row vector and clear
column vector

Age Matrices 8
Age Matrix for IQ
● After waking up* in IQ, and ready to schedule, set BID
vector bit *Details in further sections

Age Matrix for IQ 9


Age Matrix for IQ
count = zeros(#instructions)
for instruction_in_ready_instructions:
temp = bitwise_AND(row_vec, BID)
count[instruction] = #ones in temp
● If ∃ nth oldest instruction, ∃ at most (n-1) older
instructions
● count <= n-1
● Thus, to select IW (Instruction Width) oldest
instructions, select the ones with count = IW - 1

Age Matrix for IQ 10


Age Matrix for ROB
● When speculation resolves, clear corresponding bit in
SPEC vector
for every_completed_instruction:
if bitwise_AND(row_vec,SPEC)==all_zeros:
commit

Age Matrix for ROB 11


Age Matrix for ROB
● Need to take care of exceptions/ misspeculation
● More importantly, oldest exception/ misspeculation

Age Matrix for ROB 12


Age Matrix for ROB
result = zeros(#entries, #entries)
for every in_flight instruction:
result[instr] = bitwise_AND(row_vec, VLD)
one_hot = reduction_NOR_along_rows(temp)
● Only for the oldest instruction, result is all zeros
● After reduction NOR, we have one-hot vector pointing to
instruction which raised exception/ misspeculation

Age Matrix for ROB 13


Commit Dependency Matrix
● Local state of an instruction:
○ Completed
○ Raised exception
○ Flushed by some other event
● Global state of an instruction:
○ Completed
○ Misspeculation
○ Exception
● Commit depends on local state of the instruction and
global state of older instructions
Commit Dependency Matrix 14
Commit Dependency Matrix
● Structure to identify the global state earlier in the
pipeline
● # rows = # cols = # in-flight instructions
● Intersections indicate commit dependencies

Commit Dependency Matrix 15


Commit Dependency Matrix
● On every new dispatch to ROB i.e. new row entry, set
bits in every column corresponding to older instructions
which may raise exceptions
● E.g.: Memory operations, branches and synchronization
barriers
● When an older instruction completes, clear its column
vector

Commit Dependency Matrix 16


Commit Dependency Matrix
● Before commit, every instruction waits for the commit
dependency to resolve i.e. th older ones to become
non-speculative

// check global state of completed instruction


if reduction_NOR(row_vec)==1:
commit if resources available

Commit Dependency Matrix 17


Precise Exception Handling
● Delinquent instruction = Instruction causing exception
● Delinquent and all subsequent instructions (w.r.t.
program order) are squashed
● The subsequent ones are located using column vector
of the delinquent instruction
● When none are commiting, determine delinquent
(unresolved or exception causing) using age matrix

Precise Exception Handling 18


Matrix Merging
● Global state is same for all instructions in ROB
● Thus, merge commit dependency matrix and age matrix
for ROB (saves space of order O(n2))
● A SPEC vector now tracks speculative instructions
// at commit time
for a completed_instruction:
result = bitwise_AND(row_vec, SPEC)
if reduction_NOR(result)==1:
grant to commit
Matrix Merging 19
Memory Disambiguation matrix
● The order and dependencies between loads and stores
need to be explicitly tracked
● This is done to decide the absence of conflicts as early
as possible

Memory Disambiguation Matrix 20


Memory Disambiguation Matrix
// when load enters LQ
row_vec = set_bit_at_store_addr_in_SQ
if store_addr != load_addr:
clear_col
if reduction_NOR(row_vec)==1:
Load_entry_becomes_non-speculative
● # rows = # loads in LQ, # columns = # stores in SQ
● A load in LQ sets its row according to older unresolved stores in SQ at
issue, and a store clears bits in its column when its address does not
conflict
● A load becomes non-speculative when all the bits in its row vector is
zero
Memory Disambiguation Matrix 21
Lockdown Matrix
● We use a lockdown matrix to track older non-performed
loads for each speculative load

Lockdown Matrix 22
Lockdown Matrix
//at commit
row_vec = set_bit_at_all_unperf_loads
if load_complete:
clear_col
if reduction_NOR(row_vec) == 1:
load_is_ordered
● # rows = # committed loads, # columns = # loads in LQ
● A committed load sets its row according to older unperformed
loads in LQ, and a performed load clears its column
● A lockdown to the address of a committed load is lifted when all
theMatrix
Lockdown bits in its row vector is zero
23
Wakeup Matrix
● Producers - Source operands
● An instruction sets its row according to its producers in
IQ at dispatch, and clears its column at issue

Wakeup Matrix 24
Wakeup Matrix
// at dispatch
row_vec = set_bit_at_all_instr_with_producers
if instr_issue:
clear_col
if reduction_NOR(row_vec) == 1:
instr_wake_up
● # rows = # columns = # Entries in IQ
● An instruction sets its row according to its producers in IQ at
dispatch, and clears its column at issue
● An instruction is woken up when all the bits in its row vector is zero

Lockdown Matrix 25
Matrix Scheduler Challenges
● Increased logic complexity
○ Fulfills O(1) time scheduling

● Timing and area constraints

● 12T SRAM arrays : 8 for dependency storage

● Eg: Age matrix performs n2 AND and n n-bit OR


operations per cycle

Matrix Scheduler Challenges 26


Processing-In-Memory Implementation
● Implements efficient matrix schedulers

● O(matrix scheduling) = O(element-wise logic operations)

● 8T SRAM arrays

● PIM supports vertical read and write mechanisms

Processing-In-Memory 27
Bit Line Computing
● Matrix scheduler stores dependency of instructions

● Precharge RBL
beforehand
● Activate RWL
● Bit count is encoded by
voltage drop on bit lines

Bit Line Computing 28


Vertical Access
● Standard SRAM arrays only support horizontal access

● Update Matrix
Schedulers
○ Column-wise write
● Memory Disambiguation
Matrix
○ Column-wise read

Vertical Access 29
Multibanking
● Multibanking for parallel processing

● Horizontally partitioned SRAM arrays


○ Matrix Schedulers into small arrays (banks)

Divided into n
single-ported banks

*n = dispatch width

Multibanking 30
The Complete Picture

The Complete Picture 31


References
● Orinoco: Ordered Issue and Unordered Commit with
Non-Collapsible Queues Dibei Chen, Tairan Zhang, et.
al
● Prof. Biswa's slides

References 32
Questions?

Questions? 33

You might also like