Second-Generation Stack Computer Architecture
Second-Generation Stack Computer Architecture
Second-Generation Stack Computer Architecture
Architecture
Independent Studies
University of Waterloo
Canada
April 2007
ii
Declaration
I hereby declare that I am the sole author of this research paper.
I authorize the University of Waterloo to lend this thesis to other institutions or individuals for
the purpose of scholarly research.
Signature:
I further authorize the University of Waterloo to reproduce this research paper by photocopying or other means, in total or in part, at the request of other institutions or individuals for the
purpose of scholarly research.
Signature:
The work in this research paper is based on research carried out in the Independent Studies
Program at the University of Waterloo, Canada. No part of this thesis has been submitted elsewhere for any other degree or qualification and is all my own work unless referenced to the
contrary in the text.
c 2007 by Charles Eric LaForest.
Copyright
The copyright of this thesis rests with the author. Quotations and information derived from it
must be acknowledged.
iii
Abstract
It is commonly held in current computer architecture literature that stack-based computers were
entirely superseded by the combination of pipelined, integrated microprocessors and improved
compilers. While correct, the literature omits a second, new generation of stack computers
that emerged at the same time. In this thesis, I develop historical, qualitative, and quantitative
distinctions between the first and second generations of stack computers. I present a rebuttal
of the main arguments against stack computers and show that they are not applicable to those
of the second generation. I also present an example of a small, modern stack computer and
compare it to the MIPS architecture. The results show that second-generation stack computers
have much better performance for deeply nested or recursive code, but are correspondingly
worse for iterative code. The results also show that even though the stack computers zerooperand instruction format only moderately increases the code density, it significantly reduces
instruction memory bandwidth.
iv
Acknowledgements
Firstly, thanks go to my family, immediate and extended, who have always given me the leeway
and support I needed, who always believed in me.
Sometime in 2000, Ralph Siemsen and Andrew E. Mileski introduced me to the Forth
programming language, which changed my view of programming. Soon after, I discovered the
microprocessors of Chen-Hanson Ting, Jeff Fox, and Charles H. (Chuck) Moore, which did
the same for my view of computer hardware. Aaron Holtzman suggested I play with FPGA
simulations of these computers, and graciously bore all my grumblings about broken Verilog
compilers. At the same time, I had stimulating email discussions with Myron Plichota and
Jecel Mattos de Assumpcao Jr. which led to some of the new ideas in this thesis.
It was Sheryl Cronk who eventually gave me the arguments and reasons to return to University. Many friends bought my old junk and helped me move. For this kick-start and support,
I am forever grateful.
Once at Waterloo, Professor Chrysanne DiMarco became my Adviser. Her thorough knowledge of the English language and of the customs of academia improved me greatly. Thus, I
must atone by myself for any linguistic errors in this thesis. Professors Giuseppe Tenti and
Barry Ferguson unrusted and expanded my mathematical skills. Professor Manoj Sachdev and
his PhD students, Shahab Ardalan and Bhaskar Chatterjee, took much time both inside and
outside of class to discuss the details of VLSI circuitry with me. Professors Mark Aagaard
helped me gain a broader perspective on computer architecture and led me to the class of Professor Paul Dasiewicz who taught me more about the subject. PhD candidate Brad Lushman
took time to help me with my exploration of programming languages. I also thank Professor
Anne Innis Dagg of Independent Studies, whose course on independent research rounded me
out well.
Outside of class, the denizens of the Computer Science Club provided both enthusiastic
discussions and gave me a chance to make my work heard. Mike Jeays provided me with a
useful and rare primary source for the KDF9, his favourite computer. Professor Steven M.
Nowick of Columbia University helped me understand his MINIMALIST synthesis tool.
The wheels of Independent Studies were kept turning by Professors Bill Abbott and Richard
Holmes and especially by Susan Gow, who provided endless enthusiasm, countless good examples, and sage advice.
The writing of this thesis was supervised by Dr. Andrew Morton here at Waterloo, and by
Professor J. Gregory Steffan of the University of Toronto. I am very grateful for their feedback
and guidance.
And finally, thanks to Joy, my fiance. You brighten my life. You make me happy.
The years ahead with you glow with promise and adventure.
But the speed was power, and the speed was joy, and the speed was pure beauty.
Richard Bach, Johnathan Livingston Seagull
If my calculations are correct, when this baby hits eighty-eight miles per hour,
youre gonna see some serious shit.
Emmet Doc Brown, in Back To The Future
vi
Contents
1 Introduction
1.1 Research Goals . . . . . . . . . . . . .
1.2 Thesis Outline . . . . . . . . . . . . . .
1.2.1 Part I: Historical Review . . . .
1.2.2 Part II: Qualitative Arguments .
1.2.3 Part III: Quantitative Arguments
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
I Historical Review
2 History of the First Generation of Stack Computers
2.1 Lukasiewicz and the First Generation . . . . . . . . . .
2.1.1 Poland: Jan Lukasiewicz (1878-1956) . . . . . .
2.1.2 Germany: Konrad Zuse (1910 - 1995) . . . . . .
2.1.3 Germany: Friedrich Ludwig Bauer (1924-) . . .
2.1.4 Australia: Charles Leonard Hamblin (1922-1985)
2.1.5 USA: Robert Stanley Barton . . . . . . . . . . .
2.2 The First Generation of Stack Computers . . . . . . . .
2.2.1 Zuse Z4 . . . . . . . . . . . . . . . . . . . . . .
2.2.2 English Electric Co. KDF9 . . . . . . . . . . . .
2.2.3 Burroughs B5000 and later models . . . . . . . .
2.2.4 International Computers Ltd. ICL2900 series . .
2.2.5 Hewlett-Packard HP3000 . . . . . . . . . . . . .
2.3 Shortcomings and Disappearance of the First Generation
2.3.1 Explicit High-Level Language Support . . . . .
2.3.2 The Rise of RISC . . . . . . . . . . . . . . . . .
2.3.3 Excessive Memory Traffic . . . . . . . . . . . .
2.3.4 The Need for Index Registers . . . . . . . . . .
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
3
3
3
3
7
7
7
8
8
9
10
11
11
12
14
16
17
18
18
18
19
20
21
21
21
21
22
3.2
3.3
3.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
23
23
25
26
26
26
28
28
28
29
29
29
II Qualitative Arguments
31
33
34
35
36
37
37
38
39
40
40
.
.
.
.
.
45
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
47
48
48
48
48
48
49
49
49
50
50
6.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
52
52
53
54
56
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
59
60
60
61
61
64
65
67
68
68
68
69
70
72
74
76
79
79
81
83
84
.
.
.
.
.
85
86
87
88
88
89
91
93
93
94
96
9.2.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
99
99
100
101
101
102
103
104
106
108
108
109
109
110
112
112
113
113
115
115
116
117
118
119
120
121
121
122
124
125
125
131
135
140
.
.
.
.
.
141
141
141
142
142
143
.
.
.
.
.
.
.
.
.
.
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
143
144
145
145
145
147
147
148
149
150
151
xi
xii
List of Tables
5.1
6.1
6.2
6.3
6.4
6.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
50
51
52
52
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11
7.12
7.13
7.14
7.15
7.16
7.17
7.18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
64
65
65
66
67
71
71
73
73
75
75
76
76
77
77
78
78
9.1
9.2
9.3
B.1
B.2
B.3
B.4
B.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
142
142
143
143
B.6
B.7
B.8
B.9
B.10
B.11
B.12
B.13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xiv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
144
145
146
147
147
148
149
150
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.1
3.2
4.1
4.2
4.3
6.1
6.2
6.3
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
79
80
81
81
82
82
83
83
8.1
8.2
8.3
8.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
86
86
87
9.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
11
12
13
15
15
16
17
xvi
List of Algorithms
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
xvii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
54
54
55
55
55
57
57
57
71
71
71
72
73
73
74
75
75
76
76
76
77
77
77
78
78
78
88
95
96
97
xviii
Chapter 1
Introduction
I first learnt about stack computers in 2000 while working at a computer manufacturer where
co-workers introduced me to the Forth programming language, a stack-based programming
environment. Soon after, while looking for a suitable processor for a homebrew computer
system, I came across a mention of the MuP21 [MT95] in the Usenet Embedded Processor
and Microcontroller Primer and FAQ1:
The MuP21 was designed by Chuck Moore, the inventor of Forth. With the
MuP21, Forth can compile into machine code and still be Forth, because the machine code IS Forth. The MuP21 freaks out at 100 MIPS while consuming only 50
milliwatts. Not only that, the chip includes a video generator, has only about 7000
transistors (thats right, 7000 and not 7,000,000), and costs about $20.
The assembler on this chip is a sort of dialect of Forth, as the CPU is modeled
after the Forth virtual machine. MuP21 is a MINIMAL Forth engine. [. . . ] The
CPU programs the video generator and then just manipulates the video buffer. It
is composite video out, so it only needs one pin. MuP21 is only a 40 pin chip.
Id never heard of anything like it. It was smaller and faster, and its machine code was a
structured language! I was hooked. Understanding this type of hardware and software became a hobby that ultimately led me to pursue a University degree on the topic. However, I
couldnt simply take a Computer Engineering degree since this kind of computer is virtually
non-existent in the mainstream literature and totally absent from the curriculum. Therefore, I
had to create one under the aegis of the Independent Studies (IS) program.
The IS program is a self-directed course of study guided and vetted by a Faculty Adviser
and composed of a combination of Independent Study Units and regular courses. After two
years of study (typically), a student petitions to enter Thesis Phase and if approved, spends a
year developing a thesis on a selected topic. A successfully completed thesis grants the degree
of Bachelor of Independent Studies (BIS). Overall, IS bears more resemblance to graduate
studies than undergraduate ones.
The structure of this thesis reflects the directions I have taken throughout the IS program.
I began with broad historical explorations of stack architecture and programming languages,
complemented by regular engineering courses on digital systems, computer architecture, and
1
integrated circuits. These efforts eventually concentrated on defining, simulating, programming, and partially implementing a particular stack computer design. In this thesis, I leave
aside the issues of programming language design and VLSI implementation to focus on the
architecture of the computer itself.
low-level analyses of how they execute iterative, recursive, tail-recursive, and nested subroutine
code. The issue of pipelining Gullwing is explored as a transformation of the DLX pipeline.
Gullwing is found to have a definite advantage at subroutine calls and memory bandwidth, but
is unfortunately architecturally equivalent to a DLX processor without load or branch delay
slots, with the same penalty to performance.
Chapter 8 addresses Gullwings inefficient usage of memory for holding compiled code by
adding a third stack to temporarily hold instructions during subroutine calls. This new architectural feature will increase the density of code to the maximum possible value and accelerate
returns from subroutines.
Finally, Section 9.2 outlines the addition of stacks to a MIPS processor, without altering the
pipeline or instruction set, in order to give it the efficient subroutine call mechanism of a stack
computer. This section also introduces the addition of two forms of parallelism to Gullwing:
one which reduces its instruction count with compound stack operations, and the other which
reduces its CPI by overlapping the execution of instructions.
Appendix A provides the source to the Flight language kernel and the software used to
benchmark Gullwing. Appendix B contains the tabulated raw data from the analyses of the
dynamic and static properties of Gullwing machine code.
Part I
Historical Review
Chapter 2
History of the First Generation of Stack
Computers
I present here the first generation of stack computers in the context of the pioneers of the field
and of the machines that followed their insights. I discuss the organization and design goals of
these computers, their shortcomings, and ultimately their replacement by RISC designs.
/
/+
/+5
/+55
/ 10
/ 10 2
5
Figure 2.1: Evaluation of Polish Notation expression / + 5 5 2
Figure 2.2: Fig. 1 from Bauer and Samelson German Patent #1094019
into a sequence of machine instructions. For example, the expression (5 + 5)/2 is expressed
unambiguously as 5 5 + 2 /, while the alternative interpretations (assuming no parentheses)
would be written as 5 2 / 5 + or 5 5 2 / +. Furthermore, only a single stack is required since the
operators are never waiting for their operands.
Hamblin expanded upon this insight in a pair of 1957 papers [Ham57a][Ham57b]1 (reprinted
[Ham85]). In summary:
It is now possible to visualize the general shape a machine designed to use such
code might take. It would have a running accumulator and nesting register as
described, and a number-store arranged on something like the pattern indicated,
[. . . ]
The running accumulator is a stack and is equivalent to Bauers Number Cellar. The nesting
register is of the same structure but holds the return addresses of subroutines. This separation
of evaluation and flow-control into two stacks, which are also separate from main memory, is
the main architectural feature of second-generation stack computers (Figure 4.3).
Some employees of the English Electric Co. were present when Hamblin delivered his first
paper [All85][Dun77]. They integrated his ideas into their next computer, the KDF9.
10
2.2.1 Zuse Z4
The Z4 is too simple to fit into either the first or second generation of stack computers, not
being a stored-program machine, but it is the earliest one known and thus deserves mention.
Originally built in 1945, it was damaged during World War II, and later rebuilt in 1950. It
currently resides in the Deutches Museum in Munich, Germany.
Like many of Zuses computers, the Z4 was designed to perform engineering calculations.
Its program was read from a punched plastic tape2 , and it included a second tape reader as a
form of subroutine call. Its ingenious mechanical main memory held 64 32-bit floating-point
numbers which could be loaded into a stack of 2 elements, operated upon, and then stored back.
It supported a full complement of basic arithmetic operations, including square root and some
built-in constants such as . Its 8-bit instruction format was zero-operand, with one-operand
loads and stores for direct addressing of memory. No address calculations or access to the
stack pointer were possible. It supported conditional stop, skip, and call instructions. Figure
2.3 shows the programming model for the Z4. It is a simplified reproduction from Blaauw and
Brooks book [BFPB97, fig. 10-29].
Mem
Addr
Memory
Instr
Punched Tape
Memory Data
Op
Stack Level 0
Stack Level 1
ALU
11
12
13
14
16
Ditzel and Patterson criticized the original arguments for High-Level Language Computer Systems (HLLCS) [DP80] (reprinted [DP86, FL86], and [DP98b] with updated
comments [DP98a]), and conclude that . . . almost any system can be a HLLCS through
the appropriate software. . . . They also wrote an overview of the arguments for reduced
instruction sets [PD80].
Patterson later wrote an extremely broad article on the features and successes of the early
RISC experiments, including software measurements and compiler techniques [Pat85]
(reprinted [Pat86, FL86]), and advocates taking implementation as a factor in computer
architecture.
At Berkeley, Patterson and Sequin headed the RISC I and RISC II projects as one approach to RISC designs [PS81] (reprinted [PS98a] with updated comments [PS98b]).
The fine details of their implementation were presented in the PhD thesis of one of their
students, Manolis Katevenis [Kat85].
One of the premises of RISC design is that the hardware and the software must be considered together. Hennessy and Jouppi measured the low-level features of software and
proposed some architectural guidelines to support them without the pitfalls of past highlevel language support. These included . . . the use load/store architecture and the absence of condition codes. . . . [HJB+ 82]. These data guided the Stanford MIPS project
[HJP+ 82].
The major technological change of the time was the emergence of Very Large Scale Integrated (VLSI) circuits which made possible the implementation of an entire processor
on a single chip. The various approaches to integrated architecture are discussed by
Hennessy [Hen84] (reprinted [Hen86, FL86]).
B7700 was likely the only first-generation stack computer to address this problem by including
a genuine 32-entry buffer (Figure 2.7) for the top of the stack [Bur73, pg.3-36] .
Character Mode processed 6-bit Binary-Coded Decimal numbers, while Word Mode processed binary 48-bit
numbers.
20
Chapter 3
History of the Second Generation of Stack
Computers
In this chapter, I present the second generation of stack computers in the context of the pioneers
of the field and of the machines that followed their insights. At the same time that the first
generation of stack computers was fading away in the light of RISC, the second generation
began to emerge and found a niche in embedded control systems instead of general-purpose
computing.
21
Burroughs B5500 computer was the influence for the use of a stack for expression evaluation
[Moo91]. The best introduction to Forth and the methodologies it favours are a pair of books
by Leo Brodie [BI86] [Bro84].
A Forth system is divided into two interpreters. The outer interpreter receives source input
and looks up each word in a dictionary. If found, it calls the inner interpreter to process the
words definition. In the most basic case a word definition is a series of addresses of other
words, themselves defined in the same manner, ending with primitives which are written in
machine code2 . The inner interpreter is a small virtual machine which walks through these
definitions and executes the primitives it encounters. The inner interpreter keeps track of the
nesting of these definitions on a stack in memory, commonly referred to as the Return Stack.
The Forth primitives do their operations on another such stack, the Data Stack, where they
take their arguments and place their results. The primitives are thus simple function applications which can be composed by executing them in sequence. Higher-level words are functional
compositions of these primitives. These new words interact with the stack and compose in the
same manner as the primitives.
A second-generation stack computer is basically a physical realization of the inner interpreter, the Forth primitives, and the stacks. The primitives become the instruction set which
operates on a hardware Data Stack. The inner interpreter reduces to simple call and return
instructions which use a Return Stack to store the return addresses of subroutines.
This is known as indirect-threaded code. There exists also direct-threaded, string-threaded, token-threaded,
and subroutine-threaded versions, each with different size/speed trade-offs. Second-generation stack computers
are subroutine-threaded systems.
22
Now Intersil.
23
to/from memory as required. The implementation used about 9000 gates. Figure 3.2 shows a
block diagram of the main processor [Sha02, fig.1].
Contrary to the NC4016 or the RTX-2000 the Sh-BOOM did not use an unencoded instruction format, but packed four 8-bit, Forth-like instructions into each memory word. This formed
a simple instruction cache that allowed instructions to be executed while the next memory fetch
was in progress. This also allowed very small loops to execute from within the cache without
requiring an instruction fetch. Another interesting feature was the use of conditional SKIP
instructions which, if the condition was met, would skip over the remainder of the instructions
in the memory word. Conditional jumps and calls were implemented this way [GWS91].
The Sh-BOOM broke away from a pure stack architecture by including 16 general-purpose
registers (g0-g15), and by making most of the on-chip return stack addressable as registers (r0,
r1, etc. . . ). The general-purpose registers were used for temporary storage and for I/O operations, much like the KDF9. To support the stack frames of ALGOL-like languages, instead of
simply pushing values on the return stack, a number of empty locations could be allocated in
one step and then later filled from the general-purpose registers.
The Sh-BOOM design is currently being marketed by Patriot Scientific4 as the IGNITE I
[Sha02] (previously PSC1000 [Sha99]) processor core, targeted at embedded Java applications.
It is the most sophisticated second-generation stack computer currently available.
(a) NC4016
(b) RTX-2000
http://www.ptsc.com/
24
3.2.4 MuP21
First published in 1995 [MT95], the MuP21 was the first of a family of chips dubbed Minimal
Instruction Set Computers (MISC). Like the Sh-BOOM, it used a packed instruction format
which held four 5-bit instructions in a 20-bit memory word. The internal stacks and ALU were
21-bits wide to support arithmetic carry. The data stack was only six cells deep, and the return
stack was a mere 4 cells deep. An address register (A) was added for temporary storage and
for memory accesses.
Like the Sh-Boom, the MuP21 also contained a small auxiliary processor which had priority access to the memory bus. However, it was a video processor which executed its own
instruction set, also 5-bits wide, which was tailored for creating a 16-colour NTSC signal at an
output pin. A frame of video was described by a block of these instructions which could be
manipulated by the main processor.
Amazingly, the entire implementation used only 7000 transistors in a 1.2u process, had a
typical speed of 80 MIPS, and dissipated only 50 mW. There are hints that the design was
fully asynchronous. The MuP21 was an influential design. Its simplicity made it an ideal
choice for undergraduate and hobbyist projects, usually as FPGA implementations (for example: [HH00]).
25
3.2.5 F21
Jeff Fox had formed UltraTechnology in 1990 in Berkeley, to develop a custom computer in
collaboration with Chuck Moore. The result was the F21 microprocessor [Fox98], which was
an extension of the MuP21. The instruction set added post-incrementing loads and stores from
the address register and the top of the return stack, and some extra arithmetic operations. The
stacks were expanded to about 17 cells each. Like the MuP21, the memory interface was
20-bits wide and values were stored internally as 21 bits.
Like the MuP21, the F21 contained a video coprocessor, and added similar coprocessors
for analog I/O and serial networks. Some common routines were included in on-chip ROM.
In a 0.8u process, the F21 was implemented in about 15,000 transistors, and had a typical
execution rate of 100 MIPS (peaking internally at 500 MIPS), depending on memory access
time. Ultratechnology ceased to exist in 2002 with nothing formally published about the design
and only some prototype chips made. The website for the company5 contains some fairly
detailed documentation. For a reconstruction of what the block-level design might have been
like, see Figure 6.1.
3.2.6 c18
Around 2001, Moore took the F21 design in a new direction and produced the c18 [Moo01b,
Moo01a]. Architecturally, it is virtually identical to the F21, but adds a second address register
for on-chip communication. Its width was also reduced to 18 bits to match the fast cache
memory chips available at the time. This leaves room to pack only 3 instructions per memory
word.
The coprocessors were eliminated and replaced by a watchdog timer. There is no external
memory bus. External memory must be accessed via the parallel I/O pins, and programs must
be loaded into a few hundred words of on-chip RAM to be executed.
The c18 was simulated in a modern 1.8V 0.18u process. It had a predicted *sustained*
execution rate of 2400 MIPS, while dissipating about 20 mW. It was an aggressive, fully asynchronous design.
The c18 was targeted at multiprocessing. A 5x5 array of c18s, connected by horizontal
and vertical buses, would fit in 7mm^sq. This eventually became realized as the SEAforth-24
multiprocessor currently entering production at Intellasys6 (Intelligent Array Systems) .
26
The only documentation was at http://www.stringtuner.com/myron.plichota/steamer1.htm which is now defunct, but archived in the Internet Archive Wayback Machine at http://www.archive.org/web/web.php
8
http://www.jwdt.com/~paysan/4stack.html
9
http://www.jwdt.com/~paysan/b16.html
10
http://www.comp.nus.edu.sg/~yuenck/
11
One of two papers available at http://www.comp.nus.edu.sg/~yuenck/stack
12
http://www.eforth.com.tw/
13
http://www.eforth.com.tw/academy-n/Chips.htm
14
Publication list: http://www.eforth.com.tw/academy-n/Bookstore/bookstore_4.htm
15
Published by Offete Enterprises: http://www.ultratechnology.com/offete.html
27
29
30
Part II
Qualitative Arguments
31
Chapter 4
Distinguishing the First and Second
Generations
The existing computer architecture literature considers all stack computers to be of the same
kind1. This view seems correct when contrasted against modern register-based machines. However, it conflates the first and second generations of stack computers, which makes difficult a
clear discussion of their respective properties. Distinguishing the generations is important since
the second resembles current register-based designs much more than it does the first. Without
this distinction, an entire branch of computer architecture is missed solely due to a categorical
error.
The differences between the generations stem from the two historical approaches to stack
computation. The first generation is based on Bauers stack principle for subroutine storage
allocation (Section 2.1.3) and is exemplified by the Burroughs B5000 series (Section 2.2.3).
The second generation originates in Hamblins method for evaluating and composing expressions (Section 2.1.4), first seen in the English Electric KDF9 (Section 2.2.2), and later independently rediscovered as the Forth programming language (Section 3.1.1.1).
The only significant exception to this conflation Ive found is a section in Feldman and
Retters text [FR93, pp.599-604] which lists some of the same first generation stack machines
as I do in Section 2.2, then proceeds to explain in a nutshell the origin and features of the second
generation of stack computers, although they do not refer to them by that name.
In this chapter, I expand on Feldman and Retters statements and propose a codification
based on some properties of the stacks: their location, purpose, and the operations done upon
them.
Except for a passing mention in the preface of Koopmans book [Koo89] and a short summary in Baileys
PhD Thesis [Bai96].
33
Stack Pointer
Stack
CPU
Locals of B
Return Address
Parameters of B
Return Value from B
Locals of A
Return Address
Parameters of A
Return Value from A
Locals of B
Return Address
Parameters of B
Return Value from B
Locals of A
Return Address
Parameters of A
Return Value from A
Registers
Memory
CPU
Memory
34
CPU
Return Address
Return Address
Return Stack
36
Chapter 5
Objections Cited by Hennessy & Patterson
Hennessy and Pattersons magnum opus Computer Architecture: A Quantitative Approach
[PH90, HP96, HP02] has a tremendous influence on the field of computer architecture. I support this claim by comparing the number of citations it has received compared to other textbooks on computer architecture.
Most arguments against stack computer architecture are drawn from a few statements found
in this book. I present counterarguments which support the claim that the statements are valid
for the first generation of stack computers, but not the second.
http://portal.acm.org/
http://citeseer.ist.psu.edu/
37
Incidentally, the main text on second-generation stack computers [Koo89] had, according
to the ACM Guide, seven citations in the Fall of 2004 and twelve as of January 2006. It is not
listed in CiteSeer.
Book
[PH90]
[HP96]
[HP02]
[PH98]
[Hwa92]
[Kog90]
[Omo94]
[SSK97]
[Sta93]
[Wil01]
[Hay97]
[Wil96]
[MP95]
Citations
324
310
19
21
50
24
6
6
2
1
1
1
1
Book
[FR93]
[MK97]
[HVZ95]
[Sta02]
[GL03]
[Omo99]
[Bur98]
[Sto92]
[Wil91]
[Sta90]
[Mur90]
[Bla90]
Citations
1
1
0
0
0
0
0
0
0
0
0
0
Table 5.1: Comparison of Citations of Computer Architecture Texts (as of Fall 2004)
The arguments in this section suggest an interesting way of thinking qualitatively about stacks: that they
are the reciprocal (or the inverse) of registers. For example, reading a register does not destroy its contents,
but writing does. Conversely, reading from a stack pops information from it, but writing to it simply pushes the
existing information down. Up to its maximum capacity, no information is lost. This is a tantalizing symmetry.
4
http://www.ece.cmu.edu/~koopman/stack_compiler/index.html
5
http://www.complang.tuwien.ac.at/papers/maierhofer%26ertl97.ps.gz
6
http://www.complang.tuwien.ac.at/papers/maierhofer%26ertl98.ps.gz
39
entering or exiting a subroutine (Section 7.3.6). This overhead is virtually nil in stack computers . Finally, there are hints that a redesigned stack computer instruction set could combine
stack manipulations with arithmetic operations for free, without adding any new datapath or
control lines (Sections 9.2.2 and 9.2.3).
2. The limitations are partially a matter of compiler technology (Section 5.3) and on the
other hand, stacks avoid the subroutine call overhead of register-based computers (Section 7.3.6).
3. This is a straw-man: All stacks have a bottom. Past experiments have shown that an
on-chip stack buffer that is 16 to 32 elements deep eliminates virtually all in-memory
stack accesses for both expression evaluation and subroutine nesting and that the number
of accesses to memory decreases exponentially for a linear increase in hardware stack
depth [Koo89, 6.4] [Bai96, 4.2.1] (Appendices B.2.6 and B.2.7).
Current Intel processors use exactly such a stack, 16 elements deep, to cache return
addresses on-chip [Int06, 2.1.2.1, 3.4.1.4], as does the Alpha AXP 21064 processor
[McL93]. Hennessy & Patterson themselves show data supporting this feature [HP96,
pg. 277] [HP02, pg. 214]. The SPARC architecture accomplishes a similar results with
its register windows.
The preceding quote is however an abridged version. The original statements by Bell et al.
were:
The System/360 designers also claim that a stack organized machine such as the
English Electric KDF 9 (Allmark and Lucking, 1962) or the Burroughs B5000
(Lonergan and King, 1961) has the following disadvantages:
1. Performance is derived from fast registers, not the way they are used.
2. Stack organization is too limiting, and requires many copy and swap operations.
3. The overall storage of general register and stack machines are the same, considering point #2.
4. The stack has a bottom, and when placed in slower memory there is a performance loss.
5. Subroutine transparency is not easily realized with one stack.
6. Variable length data is awkward with a stack.
We generally concur with points 1, 2, and 4. Point 5 is an erroneous conclusion,
and point 6 is irrelevant (that is, general register machines have the same problem).
[BCM+ 70]
Hennessy & Patterson are simply repeating the points which are supported by the authors of
this quote. In retrospect, it is peculiar that the authors lump both the KDF9 and the B5000
together, despite being very different machines (see Sections 2.2.2, 2.2.3, and Chapter 4). The
arguments should not apply equally to both. It turns out that the quote is an extremely abridged
version of the original statements by the System/360 designers:
Serious consideration was given to a design based on a pushdown accumulator
or stack. This plan was abandoned in favor of several registers, each explicitly
addressed. Since the advantages of the pushdown organization are discussed in
the literature, it suffices here to enumerate the disadvantages which prompted the
decision to use an addressed-register organization:
41
1. The performance advantage of a pushdown stack organization is derived principally from the presence of several fast registers, not from the way they are
used or specified.
2. The fraction of surfacings of data in the stack which are profitable, i.e.,
what was needed next, is about one-half in general use, because of the occurrence of repeated operands (both constants and common factors). The
suggests the use of operations such as TOP and SWAP, which respectively
copy submerged data to the active positions and assist in clearing submerged
data when the information is not longer required.
3. With TOPs and SWAPs counted, the substantial instruction density gained
by the widespread use of implicit addresses is about equalled by that of the
same instructions with explicit, but truncated, addresses which specify only
the fast registers.
4. In any practical implementation, the depth of the stack has a limit. The register housekeeping eliminated by the pushdown organization reappears as
management of a finite-depth stack and as specification of locations of submerged data for TOPs and SWAPs. Further, when part of a full stack must
be dumped to make room for new data, it is the bottom part, not the active
part, which should be dumped.
5. Subroutine transparency, i.e., the ability to use a subroutine recursively, is one
of the apparent advantages of the stack. However, the disadvantage is that the
transparency does not materialize unless additional independent stacks are
introduced for addressing purposes.
6. Fitting variable-length fields into a fixed-width stack is awkward.
In the final analysis, the stack organisation would have been about break-even for
a system intended principally for scientific computing. Here the general-purpose
objective weighed heavily in favor of the more flexible addressed-register organization. [ABB64]
Ill address these points individually:
1. As previously mentioned, on-chip stacks are really a linear collection of registers. Additionally, the unabridged statement supports the use of stacks when either fully on-chip,
as in second-generation stack computers (which the KDF9 was), or sufficiently buffered
as in the B7700 (Sections 2.2.3 and 2.3.3) which was not introduced until 1973.
2. While stack permutations are inevitable, how often they are required is a strong function
of how the code is arranged by the programmer or compiler. It also depends on whether
the code is primarily iterative, recursive, or composed of nested procedures (Section
7.3.1). The choice of stack instruction set affects the overhead significantly (Section
9.2.2).
3. The static code analyses for a basic second-generation stack computer (Appendix B.1.5)
support this statement, but also suggest that the instruction density can be increased much
further (Chapter 8). This is also dependent on the nature and arrangement of the code.
42
4. Experiments done since show that a stack needs to be only 16 to 32 elements deep to
virtually eliminate stack overflow to main memory [Koo89, 6.4.1] [Bai96, 4.2]. What
memory traffic remains can be mitigated by the hardware stack management algorithms
that have been discovered since [Koo89, 6.4.2] [Bai96, 6.2]. The abridged versions
of this statement omitted the trade-off between the register housekeeping required for
subroutine calls on register-based computers versus the stack housekeeping required for
iterative code on stack-based computers.
5. I suspect the authors were thinking of the KDF9 since first-generation stack computers exhibit subroutine transparency using a single stack. This is possible because the
activation record of an instance of a subroutine is stored in main memory. For secondgeneration stack computers like the KDF9, whose stack is not randomly accessible, a
single stack is insufficient since at the very least the arguments being passed recursively
would be buried under the return address (Section 4.2). Thus at the minimum a second
stack is needed to hold return addresses.
The statement is correct for both generations when index registers are used (Sections
2.3.4 and 3.4.1) since unless they are also stacks themselves, nested subroutines which
use them would have to do housekeeping to preserve and access their values, effectively
implementing in software the stack of a first-generation stack computer at a great performance penalty. Finally, given integrated circuits, the cost of an extra stack is minimal.
6. Bell et al. are correct when saying that this point is irrelevant. The problem, if present,
is orthogonal to the internal organization of a fixed-width machine.
In summary, the arguments quoted by Hennessy & Patterson were oversimplified and referred
to old hardware and compiler technology. In fact, the original source of the arguments is far
less critical of stacks than the version Bell et al. present and most of the shortcomings have
been overcome since then.
43
44
Part III
Quantitative Arguments
45
Chapter 6
A Stack-Based Counterpart to DLX:
Gullwing
I present the Gullwing1 processor architecture as an unpipelined, stack-based analogue of Hennesy and Pattersons DLX [HP02]: a small design that is a simplified reflection of the current
state of the art. Gullwing is closely based on the available descriptions of the MuP21, F21, and
c18 processors (Chapter 3).
MEM
ISR
ALU
32
MEM
S0 5 S1 5 S2 5 S3
5
5
L
INST
5 S4
adr
LSB
5
A
TOP
DS
+1
PC
MAR
R
RS
MSB
5
2
x PC@
5 S5 5
The name Gullwing was inspired by Richard Bachs book, "Johnathan Livingston Seagull", and also by the
DeLorean DMC-12 made famous in the "Back To The Future" movie trilogy by Robert Zemeckis and Bob Gale.
47
This is not the instruction fetch bottleneck it seems to be. See Section 6.3.1 for why a second bus would
remain idle over 90% of the time.
3
These two bits could be used as a seventh instruction taken from the set of instructions whose three MSB
are always zero, or as subroutine return and call flags as in the NC4016 and RTX-2000. They are left unused for
simplicity.
48
of the instruction registers are shifted over by one instruction and filled-in at the far end with a
Program Counter Fetch (PC@) instruction (see Section 6.2). After the last loaded instruction
is executed, the PC@ instruction fetches the next instruction group into the ISR.
If the instruction after INST is a PC@, then INST is the last one of this group. This is
signalled by the Last line (L) being high during the execution of INST. If the encoding of PC@
is conveniently chosen to be all zeroes, then L is simply the NOR of all the bits of S1. This
line is used to overlap the execution of INST and PC@ whenever possible (Section 6.3.1).
respectively take theirs from the Program Counter (PC) and the Return Register (R). They all
execute in two cycles, except for PC@ which requires only one.
Mnemonic
Name
PC@
PC Fetch
JMP
Jump
JMP0
Jump Zero
JMP+
Jump Plus
CALL
Call
RET
Return
Operation
Fetch next group of instructions into ISR
Unconditional Jump
Jump if top of data stack is zero. Pop stack.
Jump if top of data stack is positive. Pop stack.
Call subroutine
Return from subroutine
Mnemonic
Name
Operation
LIT
Fetch Literal Push in-line literal onto data stack
@A+
Load A Plus Push MEM[A++] onto data stack
@R+
Load R Plus Push MEM[R++] onto data stack
@A
Load A
Push MEM[A] onto data stack
!A+
Store A Plus Pop data stack into MEM[A++]
!R+
Store R Plus
Pop data stack into MEM[R++]
!A
Store A
Pop data stack into MEM[A]
Table 6.2: Gullwing Load and Store Instructions
50
are destructive: they pop two values from the DS and push one result back. Unary operations
simply replace the contents of the top of the DS. All these instructions execute in a single cycle.
The +* operation is unusual in that it is conditional and non-destructive. It replaces the
contents of TOP with the sum of TOP and of the first element of DS, only if the original value
of TOP was odd (least significant bit is set). Combined with shift operations, it provides a
means of synthesizing a multiplication operation.
Mnemonic
Name
NOT
Not
AND
And
XOR
Xor
+
Plus
2*
Two Star
2/
Two Slash
+*
Plus Star
Operation
Bitwise complement of top of data stack
Bitwise AND of top two elements of data stack
Bitwise XOR of top two element of data stack
Sum of top two elements of data stack
Logical shift left of top of data stack
Arithmetic shift right of top of data stack
Multiply step
shifting out the multiplier and leaving the completed 8-bit product on the top of the stack. The
multiplicand is then discarded.
This method produces one bit of product approximately every two cycles and is a compromise between full hardware and software implementations. Its disadvantage is that it is limited
to single-word products, and thus to half-word factors. However, it can also be optimized when
the multiplier is much smaller than the multiplicand.
The multiplication of full-word factors, with a double-word product, could be accomplished
in the same way by shifting across two registers. For example, TOP could hold the multiplicand
and A the multiplier. By having +* test the LSB of A, and 2/ shifting from TOP into A, the
process would leave the lower half of the product in A and the upper half in TOP. Executing
A> would then place both halves on the Data Stack.
Mnemonic Name
A>
A From
>A
To A
DUP
Dup
DROP
Drop
OVER
Over
>R
To R
R>
R From
Operation
Push A onto data stack
Pop data stack into A
Duplicate top of data stack
Pop data stack
Push copy of second element onto data stack
Pop data stack and push onto return stack
Pop return stack and push onto data stack
52
1
2
3
4
5
S0
S1
S2
S3
>R
JMP0 DUP
LIT
Address for JMP0 (4)
Number for LIT
R>
XOR CALL PC@
Address for CALL
S4
+
S5
PC@
PC@
PC@
53
Outputs
--------N Control
- ------0 PC(+1)PC,MAR, MEMISR
1 MEMPC,MAR
0 PC@
1 DSTOP, DS(POP), MEMPC,MAR
0 DSTOP, DS(POP), PC(+1)PC,MAR, ISR< <
0 PC@
1 DSTOP, DS(POP), MEMPC,MAR
0 DSTOP, DS(POP), PC(+1)PC,MAR, ISR< <
0 PC@
1 PC(+1)R, RRS, RS(PUSH), MEMMAR,PC
0 PC@
1 RS(POP), RSR, RPC,MAR
0 PC@
Outputs
--------N Control
- ------0 ISR< <
0 ISR< <
0 ISR< <
Inputs
---------INST TOP S
---- --- UND2 X 0
UND3 X 0
54
Outputs
--------N Control
- ------0 ISR< <
0 ISR< <
Outputs
--------N Control
- ------0 TOPALU(NOT)TOP, ISR< <
0 TOP,DSALU(AND)TOP, DS(POP), ISR< <
0 TOP,DSALU(XOR)TOP, DS(POP), ISR< <
0 TOP,DSALU(+)TOP, DS(POP), ISR< <
0 TOPALU(2*)TOP, ISR< <
0 TOPALU(2/)TOP, ISR< <
0 ISR< <
0 DS,TOPALU(+)TOP, ISR< <
Outputs
--------N Control
- ------0 MEMTOP, TOPDS, DS(PUSH), PC(+1)PC,MAR, ISR< <
1 AMAR,(+1)A
0 MEMTOP, TOPDS, DS(PUSH), PCMAR, ISR< <
1 RMAR,(+1)R
0 MEMTOP, TOPDS, DS(PUSH), PCMAR, ISR< <
1 AMAR
0 MEMTOP, TOPDS, DS(PUSH), PCMAR, ISR< <
1 AMAR,(+1)A
0 DS(POP), DSTOP, TOPMEM, PCMAR, ISR< <
1 RMAR,(+1)R
0 DS(POP), DSTOP, TOPMEM, PCMAR, ISR< <
1 AMAR
0 DS(POP), DSTOP, TOPMEM, PCMAR, ISR< <
Outputs
--------N Control
- ------0 TOPDS, DS(PUSH), ISR< <
0 DSTOP, DS(POP), ISR< <
0 DSTOP, TOPDS, DS(PUSH), ISR< <
0 RS(POP), RSR, RTOP, TOPDS, DS(PUSH), ISR< <
0 DS(POP), DSTOP, TOPR, RRS, RS(PUSH), ISR< <
0 ATOP, TOPDS, DS(PUSH), ISR< <
0 TOPA, DSTOP, DS(POP), ISR< <
55
Using an instruction (PC@) to fetch the next group of instructions eliminates the need for
dedicated instruction-fetching hardware. The fact that it is also shifted into the ISR for free is
also very elegant. However, this means that at least one out of every seven instructions executed
will be a PC@, or about 14%5 . This overhead could be reduced by overlapping fetching with
execution while the memory bus is free.
A priori, it is unclear how much benefit would come from overlapping the instruction fetch.
Koopman provides a table of the dynamic instruction execution frequencies of Forth primitives
averaged over a set of benchmarks [Koo89, 6.3]. These primitives map almost directly to the
Gullwing instruction set and the benchmarks from Section 7.1 are also written in a Forth-like
language. The data predicts that approximately half of the executed primitives are either flow
control or load/store instructions and thus access memory. Assuming that this probability is
evenly distributed, the fetching of instructions could be overlapped half the time, reducing the
overhead to about 7% for straight-line code.
Accomplishing this overlap depends on this fact: If the the last instruction before a PC@
does not access memory, then both instructions can be executed simultaneously without conflict. Figure 6.2 shows how the ISR makes this possible. If the next instruction to be executed
is a PC@6 , then the current instruction being executed is the last one of this group. This is
signalled by raising the Last flag (L). If the current instruction does not access memory7 , then
instead of shifting the ISR at the end of its execution, the instruction will fetch the next group
of instructions from memory in the same manner as PC@, as if it had executed concurrently.
Implementing this optimization requires adding the L bit as an input to the state machine.
Flow control and load/store instructions ignore this bit since they always access memory. The
remaining instructions now have two versions, selected by L, which either shift or load the ISR.
Algorithms 8, 9, and 10 show the necessary changes.
Analysis of code executed with an overlapping instruction fetch (Appendix B.2.2) confirms
that the actual overhead of explicit PC@ instructions is reduced to 4.1 to 2.2% of the total
executed instructions . It also shows that 4.2 to 7.5% of instructions are executed concurrently
(FOLDS) with a PC@. This demonstrates a 50.6 to 77.3% reduction in instruction fetch
overhead.
The actual overhead will be lower since other flow control instructions do their own instruction loading.
Without instruction fetch overlapping the actual overhead of instruction fetching is 8.3 to 9.7%.
6
PC@ is conveniently encoded as all zeroes.
7
Since the opcodes are divided about equally between memory and non-memory instructions, a single bit (the
MSB, for example) can be used to test if the instruction accesses memory. This should simplify the implementation of the state machine.
56
Outputs
--------N Control
- ------0 TOPALU(NOT)TOP, ISR< <
0 TOPALU(NOT)TOP, PC@
0 TOP,DSALU(AND)TOP, DS(POP), ISR< <
0 TOP,DSALU(AND)TOP, DS(POP), PC@
0 TOP,DSALU(XOR)TOP, DS(POP), ISR< <
0 TOP,DSALU(XOR)TOP, DS(POP), PC@
0 TOP,DSALU(+)TOP, DS(POP), ISR< <
0 TOP,DSALU(+)TOP, DS(POP), PC@
0 TOPALU(2*)TOP, ISR< <
0 TOPALU(2*)TOP, PC@
0 TOPALU(2/)TOP, ISR< <
0 TOPALU(2/)TOP, PC@
0 ISR< <
0 PC@
0 DS,TOPALU(+)TOP, ISR< <
0 DS,TOPALU(+)TOP, PC@
Outputs
--------N Control
- ------0 TOPDS, DS(PUSH), ISR< <
0 TOPDS, DS(PUSH), PC@
0 DSTOP, DS(POP), ISR< <
0 DSTOP, DS(POP), PC@
0 DSTOP, TOPDS, DS(PUSH), ISR< <
0 DSTOP, TOPDS, DS(PUSH), PC@
0 RS(POP), RSR, RTOP, TOPDS, DS(PUSH),
0 RS(POP), RSR, RTOP, TOPDS, DS(PUSH),
0 DS(POP), DSTOP, TOPR, RRS, RS(PUSH),
0 DS(POP), DSTOP, TOPR, RRS, RS(PUSH),
0 ATOP, TOPDS, DS(PUSH), ISR< <
0 ATOP, TOPDS, DS(PUSH), PC@
0 TOPA, DSTOP, DS(POP), ISR< <
0 TOPA, DSTOP, DS(POP), PC@
ISR< <
PC@
ISR< <
PC@
Algorithm 10 Gullwing No-Op and Undefined Instructions with Instruction Fetch Overlap
Inputs
-----------INST TOP L S
---- --- - NOP
X 0 0
NOP
X 1 0
UND0 X 0 0
UND0 X 1 0
UND1 X 0 0
Outputs
--------N Control
- ------0 ISR< <
0 PC@
0 ISR< <
0 PC@
0 ISR< <
Inputs
-----------INST TOP L S
---- --- - UND1 X 1 0
UND2 X 0 0
UND2 X 1 0
UND3 X 0 0
UND3 X 1 0
Outputs
--------N Control
- ------0 PC@
0 ISR< <
0 PC@
0 ISR< <
0 PC@
57
58
Chapter 7
Comparisons With DLX/MIPS
As stated in Section 1.1, this thesis aims to divide the family of stack-based computers into first
and second generations. Part of this distinction consists of showing that the second generation
resembles the register-based machines which replaced the first.
This chapter supports this argument by comparing the Gullwing processor from Chapter 6
with the well-known MIPS and DLX processors used as demonstrators by Hennessy & Patterson. This comparison is based on statistics derived from the organization and execution of
programs and further based on a comparison of the pipeline structure of each microprocessor.
The characteristics compared include: cycle time, cycle count, cycles per instruction, instruction count, dynamic instruction mix, memory accesses per cycle, code size, and code density.
is executed as it is received and can contain the name of any built-in or previously defined
function. The source to the kernel is listed in Appendix A.1.
60
is defined using the functions of the original kernel and all its extensions. The result is a higherlevel description of the Flight kernel, written in itself. The new kernel binary residing in the
virtual machine memory area is identical to the original4 . The compilation of the self-hosted
kernel requires 246 lines of code containing 681 names. This process exercises the functions
of the kernel and its extensions via the indirection of the metacompiler and executes 6,511,976
instructions. The source is listed in Appendix A.3.3.
Flight Language Kernel Extensions Now that a Flight kernel resides in the virtual machine
memory, is it possible to start the virtual machine and execute this new kernel. The Flight
language extensions from Section 7.1.2 are fed to the new kernel, exercising the same code,
but through the emulation layer of the virtual machine. This executes 184,993,301 instructions,
taking about 90% of the total execution time of the VM test suite.
Actually, there is a gap of one memory word between functions due to a quirk of the compilation process, but
the actual code is the same.
5
From SPECint2000 data:
Table 7.1: 2.8% XOR, plus 0.3% other logical ops.
Table 7.2: 2.1% XOR, plus 0.4% other logical ops.
61
instructions which do not exist in a given machine are represented by a dash (-).
For both the compiler and interpreter test data, the dynamic instruction mix of the Gullwing
processor differs from that of DLX and MIPS in some significant ways:
The proportion of loads and stores is lower. I believe this originates from a lack of
complex data structures in the Gullwing software, and the need for DLX and MIPS to
use a stack in main memory to pass parameters to subroutines.
There are many more immediate loads (fetches) since Gullwing cannot include small
literal operands within most instructions as DLX and MIPS do.
The proportion of calls and returns is much greater. The stack-based architecture of
Gullwing results in extremely efficient procedure linkage, which makes the inlining of
code unnecessary in most cases.
About a quarter of all the instructions executed are stack manipulation instructions. Some
are manipulations made explicit by the absence of operands in Gullwing instructions as
a consequence of the lack of random access to a stack. DLX and MIPS include implicit
move, copy, and delete operations within their instructions by using a three-operand
instruction format which gives random access to their registers. A large portion of these
stack manipulations are moves of addresses from the top of the Data Stack to the Address
Register, as performed by the >A instruction. Section 9.2.3 discusses a mean to reduce
the overhead of these moves.
Additionally, the interpreter dynamic instruction mix (Table 7.2) has some further differences:
The proportion of conditional jumps is lower on Gullwing. This is likely because of
the lack of conditionals in the main VM loop, which uses instead a table of function
addresses.
The large incidence of shifts is due to the use of the shift-left (2/) instruction in the VM
to extract individual instructions out of a memory word.
62
Benchmark
DLX/MIPS Instr.
load
store
add
sub
mul
div
compare
load imm
cond branch
cond move
jump
call
return, jmp ind
shift
and
or
other (xor, not)
other (moves)
1.3%
1.1%
1.5%
6.2%
1.6%
4.2%
0.5%
-
0.6%
0.7%
0.6%
0.6%
1.1%
4.6%
8.5%
2.5%
-
2.1%
6.4%
6.4%
0.2%
0%
6.5%
29.2%
other
0%
Gullwing Instr.
@A, @A+, @R+
!A, !A+, !R+
+
LIT, PC@
JMP0, JMP+
(incl. TAKEN jumps)
JMP
CALL
RET
2/, 2*
AND
XOR, NOT
DUP, DROP, OVER,
>R, R>, >A, A>
NOP, +*
63
Benchmark
DLX/MIPS Instr.
load
store
add
sub
mul
div
compare
load imm
cond branch
cond move
jump
call
return, jmp ind
shift
and
or
other (xor, not)
other (moves)
1.8%
3.1%
3.5%
0.7%
2.1%
6.2%
0.1%
-
1.9%
1.7%
1.1%
1.1%
0.5%
1.2%
8.7%
3.1%
-
2.7%
5.0%
7.5%
12.6%*
3.4%
3.9%
23.3%
other
0%
Gullwing Instr.
@A, @A+, @R+
!A, !A+, !R+
+
LIT, PC@
JMP0, JMP+
(incl. TAKEN jumps)
JMP
CALL
RET
2/, 2*
AND
XOR, NOT
DUP, DROP, OVER,
>R, R>, >A, A>
NOP, +*
64
The higher CPI of Gullwing does not compare favourably with that of DLX. However,
loads and taken jumps on Gullwing take two cycles, which is as if a load or branch stall always
occurred. Therefore, Gullwing is really architecturally equivalent to a DLX without load delay
slots and without delayed branches, which always stalls on loads and taken branches. For
example, if 100% of loads are assumed to stall on DLX, their load penalty increases to 29.6%
for GCC and 33.7% for Lisp, which raises the total CPI of DLX to 1.34 and 1.41 respectively,
which is comparable to Gullwing.
Correspondingly, there are some possible optimizations to Gullwing which would reduce
the CPI of loads and jumps and make Gullwing equivalent to a normal DLX (Section 9.2.3).
Test
Total Loads
Load Stalls
Load Penalty
Branch Penalty
Total Penalty
Overall CPI
GCC
Lisp
29.6% 33.7%
23%
24%
6.81% 8.10%
4%
7%
10.8% 15.1%
1.11
1.15
Test
Instr. Type
Conditionals
Subroutine
Fetches
Load/Store
ALU
Stack
Total
Extensions
VM
Fraction of Total CPI
0.071
0.031
0.300
0.304
0.168
0.180
0.312
0.276
0.169
0.277
0.292
0.233
1.312
1.301
data fetches/loads and thus only one Memory Address Register (MAR) (Section 6.1.1). Thus,
a memory access is defined as an alteration of the MAR.
The average number of memory accesses per cycle for an instruction is calculated as so:
Avg. # of accesses/cycle = (# of accesses # of cycles) avg. f raction of total cycles
The number of accesses is determined from the register transfer description (Section 6.3)
as is the number of cycles, which can alternately be seen in Appendix B.2.4. The percentage of
total cycles is listed under the C/C column in Appendix B.2.2. Table 7.5 condenses this data.
In summary, the Gullwing processor performs an average of 0.667 memory accesses per
cycle while using a single memory bus, compared to 1.421 for DLX/MIPS which uses two.
It would be ideal to reduce Gullwings memory bandwidth further, but part of the reason
it is already low is because of the extra instructions required to manipulate the stack (which
do not access memory). Reducing this instruction overhead will increase the proportion of
memory-accessing instructions and thus increase the average number of memory accesses per
cycle, towards a maximum of one (Section 9.2.3).
Either way, since loads and stores cannot (and have no need to) overlap instruction fetches
on Gullwing, the maximum number of memory accesses per cycle cannot exceed one. By
comparison, MIPS must have a minimum of one memory access per cycle. When MIPS must
manipulate its call stack, it must perform additional memory accesses in the form of data loads
and stores, increasing its memory bandwidth further7 .
Test
Instruction
Accesses Cycles Acc./Cyc.
PC@
1
1
1
FOLDS
1
1
1
CALL
2
2
1
JMP
2
2
1
JMP0, JMP+
1
1
1
JMP0, JMP+ (TAKEN)
2
2
1
RET
2
2
1
LIT
1
1
1
@A, @A+, @R+
2
2
1
!A, !A+, !R+
2
2
1
all others
0
1
0
Total of (Acc./Cyc.) F raction
Average
Extensions
VM
Fraction of Total Cycles
0.031
0.017
0.032
0.058
0.098
0.076
0.033
0.042
0.046
0.009
0.007
0.001
0.098
0.115
0.097
0.122
0.233
0.142
0.006
0.070
0.352
0.392
0.681
0.652
0.667
66
Test
CPU
Component
Loads
Stores
Total
Average
Average
Compiler
Interpreter
DLX
MIPS
DLX MIPS
GCC92 GCC00
Lisp
Perl
22.8% 25.1% 31.3% 28.7%
14.3% 13.2% 16.7% 16.2%
37.1% 38.3% 48.0% 44.9%
37.7%
46.5%
42.1%
Table 7.6: DLX/MIPS Memory Accesses Per Cycle Caused by Loads and Stores
Calculated as the sum of the fraction of total memory accesses done by instructions which alter the Program
Counter (PC): PC@, FOLDS, CALL, RET, all JMPs, and LIT, averaged over both the Ext. and VM tests.
67
Memory Accesses This is the count of the number of memory fetches, loads, and stores9 . For
MIPS, this also include the fetching of instructions. There is no such distinction in Gullwing:
fetching instructions is a special case of data loads.
Cycles This is the count of the number of cycles required. The cycle time is assumed to
be the same for both processors. All MIPS instructions each count as one cycle, while the
Gullwing instructions count as one or two cycles10 . The MIPS pipeline is assumed to be full at
the start and to never stall.
Memory Accesses Per Cycle This is a measure of the memory bandwidth required. A measure of one access per cycle implies a fully-utilized memory bus.
Cycles Per Instruction (CPI) The MIPS pipeline is assumed to never stall and thus to always
have a CPI of one. Since Gullwing instructions take one or two cycles, the CPI depends on the
code being executed.
Instructions Per Memory Word This is a measure of code density. For MIPS, this measure
is always one. For Gullwing, it is variable.
7.3.2 Demonstrators
The C source for the demonstrators was compiled to assembly language, at various optimization levels, with a little-endian MIPS32 cross-compiling version of GCC 3.4.4. The simplest
resulting code was then hand-optimized, if possible, to eliminate the quirks of the compiler
and reproduce optimal hand-assembled code. From this, the equivalent stack-oriented code
was written with an effort towards keeping the algorithm unchanged.
The first demonstrator is the triangular numbers function. It is the additive analogue of the
factorial function: each triangular number is the sum of all the preceding positive integers. It is
quite possibly the tiniest non-trivial example of looping code (since the number of repetitions
is part of the calculations). It is also interesting since it is expressible in several forms. The
iterative, recursive, and tail-recursive forms are analyzed11 .
9
In the MIPS pipeline, all instructions require one memory access to fetch the instruction, and another access
if the instruction is a load or a store. For Gullwing, there is no division between instruction and data memory.
The fetching or loading of instructions is done by the flow control instructions. Calls, returns, jumps, and taken
conditionals perform two memory accesses. Untaken conditionals and PC@ (PC Fetch) do only one. A PC@ is
implied at the end of a group of instructions that does not end in a flow control instruction. Literal fetches take
one cycle.
10
For Gullwing, conditional jumps use a variable number of cycles. If taken, they behave like a jump, call, or
return, taking two cycles. If not taken, they merely advance to the next instruction, taking one cycle. Fortunately,
in the programs shown, the conditional jump is used to test the loop exit condition or the recursion terminal case
and thus takes one cycle for all algorithm steps except the last one. It is thus considered a one-cycle instruction.
See Section 6.3 for details.
11
There is a closed form of the triangular numbers algorithm which Ive not investigated here because it requires
multiplication, which Gullwing does not have: T rin = 21 n(n + 1). Since its a straightforward expression, it
should evaluate in the same manner on stack-based and register-based computers, with perhaps a small stack
manipulation overhead on stack computers, depending on the exact instruction set.
69
70
Stack/MIPS
1.25
2.25
1.25
2.50
Stack/MIPS
0.50
1.11
1.80
MIPS
move
addu, addiu
load imm
beq
b
0
2
0
1
1
4
2
1
1
1
Gullwing
>R, R>, DUP, OVER
+
LIT
JMP0
JMP
71
12
It is fair to say that the MIPS code shown, which has been cleaned-up from the actual compiler output, is
spaghetti-code. Its not possible to optimize it further without altering the algorithm (recursing to a separate entry
point), and even then it would only save one instruction. The convoluted nature of efficient code created by a
compiler is not usually a human concern, but theres no way I can call it a good thing.
72
$sp,$sp,-32
$31,28($sp)
$16,24($sp)
$16,$4
$4,$0,$L1
$2,$0
$4,$4,-1
Tri
$2,$2,$16
$31,28($sp)
$16,24($sp)
$31
$sp,$sp,32
Entire Code
MIPS
Mem Words
13
Instructions
13
Mem Accesses (instr+data) (13+4)
Cycles
13
Derived Measures
MIPS
Accesses/Cycle
1.31
Cycles/Instruction
1.00
Instructions/Word
1.00
Stack
6
8
7
10
Stack
0.70
1.25
1.33
Stack/MIPS
0.46
0.62
0.41
0.77
Stack/MIPS
0.53
1.25
1.33
MIPS
move
addu, addiu
load imm
sw
lw
beq
jal
j
2
4
0
2
2
1
1
1
2
2
1
0
0
1
1
1
Gullwing
DUP
+
LIT
A!, etc...
A@, etc...
JMP0
CALL
RET
73
74
$sp,$sp,-32
$31,24($sp)
$4,$0,$L1
$5,$5,$4
Tri
$4,$4,-1
$31,24($sp)
$31
$sp,$sp,32
Main Body
MIPS Stack
Mem Words
9
6
Instructions
9
10
Mem Accesses (instr+data) (9+2)
7
Cycles
9
12
Derived Measures
MIPS Stack
Accesses/Cycle
1.22
0.58
Cycles/Instruction
1.00
1.20
Instructions/Word
1.00
1.67
Stack/MIPS
0.67
1.11
0.64
1.33
Stack/MIPS
0.48
1.20
1.67
MIPS
move
addu, addiu
load imm
sw
lw
beq
jal
j
0
4
0
1
1
1
1
1
4
2
1
0
0
1
1
1
Gullwing
DUP, OVER, >R, R>
+
LIT
A!, etc...
A@, etc...
JMP0
CALL
RET
75
Entire Code
MIPS Stack
Mem Words
2
1
Instructions
2
2
Mem Accesses (instr+data) (2+0)
2
Cycles
2
3
Derived Measures
MIPS Stack
Accesses/Cycle
1.00
0.67
Cycles/Instruction
1.00
1.50
Instructions/Word
1.00
2.00
Stack/MIPS
0.50
1.00
1.00
1.50
Stack/MIPS
0.67
1.50
2.00
MIPS
addu 1
j
1
Gullwing
1
+
1 RET
Add3 The stack code shows a great advantage, as previously seen in the recursive triangular
number example, even for a single nested call with one additional parameter.
$sp,$sp,-32
$31,28($sp)
$16,24($sp)
$16,$6
add2
$4,$2
$5,$16
add2
$31,28($sp)
$16,24($sp)
$31
$sp,$sp,32
MIPS
move
addu, addiu
sw
lw
jal
j
3
2
2
2
2
1
0
0
0
0
2
1
Gullwing
DUP, etc...
+
A!, etc...
A@, etc...
CALL
RET
Entire Code
MIPS
Mem Words
12
Instructions
12
Mem Accesses (instr+data) (12+4)
Cycles
12
Derived Measures
MIPS
Accesses/Cycle
1.33
Cycles/Instruction
1.00
Instructions/Word
1.00
Stack
5
3
6
6
Stack
1.00
2.00
0.60
Stack/MIPS
0.42
0.25
0.38
0.50
Stack/MIPS
0.75
2.00
0.60
77
Add4 This more complex example requires some stack manipulation. Nonetheless, its performance is still far better than the corresponding MIPS code.
Algorithm 26 Add4 C Source
int add4 (int a, int b, int c, int d){
return add2(add2(a,b),add2(c,d));
}
Algorithm 27 Add4 MIPS32 Assembly
add4: addiu
sw
sw
sw
sw
move
move
jal
move
move
move
jal
move
move
jal
lw
lw
lw
lw
j
addiu
$sp,$sp,-40
$31,36($sp)
$18,32($sp)
$17,28($sp)
$16,24($sp)
$16,$6
$17,$7
add2
$18,$2
$4,$16
$5,$17
add2
$4,$18
$5,$2
add2
$31,36($sp)
$18,32($sp)
$17,28($sp)
$16,24($sp)
$31
$sp,$sp,40
MIPS
move 7
addiu 2
sw
4
lw
4
jal
3
j
1
Gullwing
2
>R, R>
0
+
0 A!, etc...
0 A@, etc...
3
CALL
1
RET
Entire Code
MIPS
Mem Words
21
Instructions
21
Mem Accesses (instr+data) (21+8)
Cycles
21
Derived Measures
MIPS
Accesses/Cycle
1.38
Cycles/Instruction
1.00
Instructions/Word
1.00
Stack
7
6
8
10
Stack
0.80
1.67
0.86
Stack/MIPS
0.33
0.29
0.28
0.48
Stack/MIPS
0.56
1.67
0.86
78
7.4 Pipelining
The unpipelined view of the Gullwing processor presented in Chapter 6 shows a simple computer with a number of components comparable to the DLX processor of Hennessy & Patterson. However, the cycle time of Gullwing as shown must be greater than that of the DLX
simply because an instruction cycle includes the full path from the Instruction Shift Register
(ISR), through the (implicit) decoding logic, to the controlled units such as the ALU.
Pipelining Gullwing would overlap the decoding and executing of an instruction and reduce
the cycle time to that of the slowest stage, which is usually the adder in the ALU. Ill show this
change in the same manner as Koopman [Koo90], as a transformation of the well-known DLX
pipeline, but in much more detail. The resulting Gullwing pipeline has a structure that implies
a comparable cycle time to the DLX pipeline.
IF
ID
EX
MEM
WB
IM
Reg
ALU
For reference, Figure 7.1 [HP02, Fig. A-3] shows the basic shape of the DLX pipeline. Each
stage contains a particular subsystem: Instruction Memory (IM), Register File (Reg), Arithmetic and Logic Unit (ALU), and Data Memory (DM). These stages are commonly referred
to as the Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory (MEM), and
Write-Back (WB) stages. Stages are separated by pipeline registers. The Register file is simultaneously read in the ID stage and written in the WB stage.
DM
Reg
The pipelining of the Gullwing processor can be explained by exchanging the subsystems
in the DLX pipelining with those of Gullwing and following the implications.
The first change is that Gullwing has a single memory bus for instructions and data. This
removes the distinction between IM and DM. Since the IF stage fetches an instruction every
cycle, this implies a structural hazard for every data access. However, the zero-operand instruction format of Gullwing means that several instructions can be packed as a group into a single
memory word (Section 6.2.7). This reduces the occurrence of the structural hazard to between
2 and 4 percent of executed instructions (PC@ count in Appendix B.2.2). Furthermore, the end
of a group of instructions is explicitly marked by the inclusion of a PC@ (PC Fetch) instruction
which fetches the next group of instructions in parallel with the current instruction if possible.
Combined, these two features accomplish the function of the IF stage and effectively divides
it between MEM, where the instructions are fetched, and ID, where they are held and decoded.
79
The second change is the replacement of the register file with a stack, which has the disadvantage of forcing a RAW (Read-After-Write) dependency between all instructions. This
means that an instruction in ID must wait for the previous instruction in EX to reach WB before
being able to continue. However, a stack has the advantage of having its inputs and outputs
immediately available, without having to decode addresses13 . This effectively makes them
into registers equivalent to the ID/EX and EX/MEM pipeline registers connected to the ALU.
Thus the stack can be moved out of ID and placed into EX without any speed penalty. This
eliminates the WB stage and simplifies ID.
The third change is the use of direct addressing. The DLX processor uses displacement
addressing to simulate a number of other addressing modes. This requires the ALU to compute
an address for each load or store instruction, forcing EX to precede MEM. Direct addressing
removes this dependency and so both stages can now operate in parallel. Since MEM already
contains the incrementer for the Program Counter (brought over from IF in the first transformation), it can be re-used to implement post-incrementing direct addressing.
The end result of these changes results in the pipeline shown in Figure 7.2a, where ID
decodes the instructions from the ISR, and EX and MEM execute the instructions. Figure 7.2b
shows how these stages map to the existing Gullwing functional blocks. Note that the EX and
MEM stages both contain adding circuitry and so place a lower limit on the cycle time that is
comparable to that of the DLX EX stage.
The operation of the pipeline is similar to that of the DLX (Figure 7.3). Since the pipeline
introduces one stage of latency to execution, the next group of instructions is loaded into the
ISR while the last instruction in the current group (5) decodes. This process is detailed in the
next section. Instructions that require two cycles to execute, such as loads, occupy the ID
stage for the duration of their execution (Figure 7.4). Loads and stores must take two cycles
since they currently cannot be overlapped, but on the other hand there is no load delay slot.
Overlapping loads and stores are discussed in Section 9.2.3.
data
MEM
ISR
EX
MEM
A
ID
ALU
TOP
+1
DS
ID
MEM
adr
PC
MAR
EX
RS
MEM
(a) Pipeline
Since a stack is only ever accessed sequentially, the addressing of the individual stack registers reduces to a
single-bit shift register, one bit per stack register, with no decoding required. A more aggressive design would
further reduce the entire stack to a word-wide shift register.
80
Instruction 4
@A
ID
@A
4
5
Cycle
6
5
7
MEM
ID
EX/MEM
ID
EX/MEM
ID
Notes
@A stays in ID
inserted PC@ will be in S1 and enable the L signal, signalling the last instruction. If the instruction in S0 does not access memory, then both it and the PC@ execute in parallel and S0
and S1 are then loaded from S214 , which contains the actual last instruction (Figure 7.6). If
the instruction in S0 accesses memory, then the instruction fetch is not overlapped, and the
instructions are shifted as usual (Figure 7.7).
If not all the instruction slots are filled, then the compiler must make sure that a PC@ is
compiled before the last instruction and fill the remaining slots with NOPs. It must also make
sure the built-in PC@ is never executed after a previous PC@ in the same word. For example,
if the last instruction would end up in S4, then it must be moved to S5 and a NOP placed in S4.
Such instruction reorderings usually leads to a one-cycle penalty in execution.
32
LSB
5
S0 5
5
INST
MEM
PC@
MSB
5
5
5
5
5
2
x
5 S1 5 S2 5 S3 5 S4 5 PC@ 5 S5 5 NOP
5
5
L
(a) ISR for Pipeline
Instruction 4
3
ID
4
PC@
5
0
Cycle
5
6
1
EX/MEM
ID
EX
(L is set) MEM
ID
EX/MEM
ID
Notes
14
S1 is overwritten by S2 so as to prevent two PC@ from being executed in sequence. The instruction in S2,
being the actual last instruction, can never be a PC@, else the situation described in the first paragraph occurs.
82
Instruction 4
3
ID
4
PC@
5
0
Cycle
6
Notes
5
7
1
EX/MEM
ID
MEM
(L is set)
ID
MEM
ID
EX/MEM
ID
Uses MEM
Executed sequentially
Loaded from S1
Instruction 3
2
ID
JMP0
JMP0
4
0
Cycle
5
4
6
1
EX/MEM
ID
MEM
ID
MEM
ID
EX/MEM
ID
Notes
Data Hazard Slot
JMP0 stays in ID
Branch Delay Slot
Branch Target
83
84
Chapter 8
Improving Code Density
The density of Gullwing low-level code is very good. A 32-bit word holds six instruction slots.
However, instructions that require an in-line argument such as calls, jumps, and literal fetches,
also additionally use up one whole subsequent memory word1 . Including one such instruction
in a group, while keeping all slots filled, raises the memory usage to two words and thus halves
the code density to three. Adding another such instruction drops the density to two, and so on
until all six instructions require an in-line argument, with a resulting minimum code density of
six-sevenths (0.86). As the number of slots in a memory word increases, the minimum code
density increases towards unity.
This situation is unfortunately a narrow best case. It is only applicable when all instruction
slots can be filled. This is true for literal fetches since they do not alter the program flow, and
for conditional jumps since they simply continue with the next instruction if the condition is
not met. Calls and unconditional jumps always load a new instruction group2 . Memory is
word-addressed and groups always begin execution with the first slot, therefore jumping or
returning to the instructions in the slots after a call or jump is impossible and must instead go
to the next memory word after the argument. Sequences of calls thus end up wasting most of
the available slots (Figure 8.1), bringing the minimum code density down to one-half.
Sequences of calls are typical of high-level code, where a procedure is primarily composed
of calls to other procedures. Sequences of jumps are rare, never executed sequentially, and are
not considered further. The actual usage of the instruction slots is listed in Appendix B.1.5.
The low ratio of filled instruction slots suggests that there is room for significant improvement
in code density.
Unreachable Slots
CALL
Address of Subroutine 1
CALL
Address of Subroutine 2
CALL
Address of Subroutine 3
Returns take their argument from the Return Stack and so require no extra memory.
As do returns.
85
ISR
PC
PC
RET
CALL
PC@
IS
IS
RS
(a)
RS
(b)
Available Slots
CALL CALL CALL
Address of Subroutine 1
Address of Subroutine 2
Address of Subroutine 3
ISR
TOP
ISR
>R
TOP
R>
PC@
IS
RS
IS
(a)
RS
(b)
87
8.2 Implementation
The implementation of the code density optimization is straightforward, consisting mainly of
adding control lines to push and pop the IS as required. Algorithm 29 shows the changes required. Added controls are in bold, while removed controls are struck through. The instruction
fetch overlap optimizations (Section 6.3.1) are not included as they are orthogonal.
Outputs
--------N Control
- ------1 PC(+1)R, RRS, RS(PUSH), MEMMAR,PC, ISRIS, IS(PUSH)
0 PC(+1)PC,MAR, MEMISR
0 RS(POP), RSR, RPC,MAR, IS(POP), ISISR
0 PC(+1)PC,MAR, MEMISR
0 DS(POP), DSTOP, TOPR, RRS, RS(PUSH), PC@IS, IS(PUSH), ISR< <
0 RS(POP), RSR, RTOP, TOPDS, DS(PUSH), IS(POP), ISR< <
88
Silicon Area Adding a third stack increases the silicon area required by the processor. The
additional area consumed by the IS is similar to that of the Return Stack since it must be of the
same depth and slightly less wide.
The breakeven point between the additional area of the IS and the saved area in main
memory, given a uniform word width, is when the reduction R in the original size S of a
program, due to the addition of the IS mechanism, is equal to the depth D of the IS stack:
S S(1 R) = D.
For example, if R is taken to be the previously determined value of 31.5%, then S needs to
be 3.17 times larger than D for its size to be reduced by the same amount as the size of the IS.
Since a stack rarely needs to be more than 32 words deep (Section 5.5), S would only need to
be equal to 102 memory words to justify the additional area of the IS mechanism.
For larger programs, the reduction in code size would greatly outweigh the area of the IS.
The size of the main memory can thus be correspondingly reduced 3 . The lowered total silicon
area (processor and memory) would especially benefit embedded systems.
Subroutine Overhead The use of the IS eliminates the need to fetch the remaining instruction slots after a call instruction. Algorithm 29 shows that the load in the eliminated second
cycle of the return instruction is no longer required because the remaining instruction slots are
now loaded from the IS during the first cycle. This reduces the overhead of calling and returning from a subroutine to three cycles, down from four, and also reduces the associated memory
traffic by the same 25%.
This assumes that the stacks and main memory are implemented using the same memory technology.
89
90
Chapter 9
Conclusions, Contributions, and Further
Work
The first part of this thesis presented the historical origins of the first generation of stack computers and found that these machines were derived from Bauers stack principle for the allocation of storage in nested subroutines, later used in the specification of ALGOL and now seen as
the call stack of C and other programming languages, and that the second generation of stack
computers was based on Hamblins independently discovered stack principle geared at function composition instead of storage allocation. The English Electric KDF9, made commercially
available around 1963, was found to stand out as the first second-generation stack computer
and the only one until the NOVIX NC4016 in 1985. This gap, and the coincidence with the
appearance of RISC microprocessors, accounts for the obscurity of the second generation.
The second part of this thesis built upon the first by proposing a set of criteria to distinguish
first and second-generation stack computers. In summary, second generation stack computers
keep their stacks in the CPU instead of main memory, use the stacks for the evaluation and
composition of functions instead of procedure storage allocation, and use simple, RISC-like
instructions instead of complex microcoded operations. Chapter 5 then presented a rebuttal to
the influential arguments against stack architectures cited by Hennessy & Patterson and found
that they are not applicable to second-generation stack computers due to their different design
and to advances in hardware technology and compilers. The most telling finding is that some
modern processors, such as the Intel Pentium IV and the Digital Alpha AXP 21064, use a 16deep internal stack to cache return addresses in the same manner as second-generation stack
computers.
The third part of this thesis specified the design of a small second-generation stack machine, named Gullwing. The first unusual feature found was that the packing of multiple
instructions per memory word, made possible by the zero-operand instruction format, reduced
the number of sequential instruction fetches to between 8.3% and 9.7% of the total number
of executed instructions despite achieving an average code density of only 1.2 instructions per
memory word. An additional simple optimization of the instruction fetch mechanism reduced
this fetching overhead by 50.6 to 77.3%, down to 4.1 to 2.2% of the total number of executed
instructions, effectively eliminating the need for a second memory bus dedicated to fetching
instructions.
Chapter 7 then compared Gullwing to the DLX and MIPS processors, via some aggregate
91
benchmarks and some low-level code comparisons. In Section 7.2, it was observed that 23.3
to 29.2% of the executed instructions on Gullwing are stack manipulation instructions whose
actions are implicit in the three-operand instruction format of MIPS. Similarly, Gullwing must
perform 10 to 16% more immediate loads since there are no operands to hold small constants.
The average CPI of Gullwing was between 1.301 and 1.312: a penalty of 13.1 to 18.2% over
DLX. However, Gullwing is architecturally equivalent to a DLX processor without delayed
branches and loads. Without these optimizations the CPI of DLX would have been 1.34 to
1.41 instead: 2.1 to 8.4% worse than Gullwing.
It was also found that MIPS performed an average of 1.421 memory accesses per cycle,
while Gullwing required only 0.667: a 53% reduction in memory bandwidth. For instruction
fetching alone, Gullwing required 55.9% fewer memory accesses per cycle on average (0.441
vs. 1.00 for MIPS). These improvements were found to originate in the packing of multiple
instructions per memory word: 71.1 to 80.9% of code basic blocks fit within a single memory
word on Gullwing.
Section 7.3 showed that Gullwing is at a disadvantage when multiple intermediate results
are required. A comparison of iterative code demonstrated that Gullwing required 25% more
memory space, 125% more instructions, 25% more memory accesses, and 150% more cycles,
compared to MIPS, to execute the iterative algorithm due to the need to shuffle values on and
off the Data Stack.
On the other hand, Gullwing exhibits extremely efficient function calls. The same algorithm, implemented recursively on Gullwing, required 54% less memory space, 38% fewer
instructions, 59% fewer memory accesses, and 23% fewer cycles than the equivalent recursive MIPS code due to the elimination of the instructions required to simulate a stack in main
memory.
When looking at pure nested subroutines, as would be the case with precompiled libraries,
Gullwings advantage at subroutine calls is amplified further: 58 to 67% less (code) memory
space, 71 to 75% fewer instructions, 62 to 72% fewer memory accesses, 50 to 52% fewer
cycles, and a 25 to 44% reduction in memory bandwidth, despite an average CPI between 1.67
and 2.00 and a code density between 0.60 and 0.86 instructions per word.
As originally specified, the Gullwing processor was not pipelined, and so its cycle time
would have exceeded that of the pipelined DLX processor. A pipelined form of Gullwing was
specified by transforming the stages of the DLX pipeline to Gullwings stack architecture. The
result is a two-stage pipeline with parallel EX and MEM stages, which has branch delay slots like
DLX, but no load delay slots and additional branch hazard slots due to the use of a stack instead
of registers. In the worst case where these slots could not be used productively, the average CPI
of Gullwing would increase by 12.9 to 17.8%, to a range of 1.481 to 1.533. Assuming that the
ALU is the critical path of a simple pipeline, then the two-stage pipelined form of Gullwing
should have a similar cycle time to the five-stage DLX pipeline.
Finally, Chapter 8 proposed the use of a third stack to temporarily hold instructions during
subroutine calls. This Instruction Stack would maximize the density of high-level code with
many branches and calls, reducing the overall code size by a little over 30% (up to a theoretical
limit of 50%), and would reduce the memory traffic and cycle count overhead of calling and
returning from a subroutine by 25%.
92
9.1 Contributions
This thesis makes the following contributions:
1. a historical review of first-generation stack computers which uncovers the origins of the
conceptual difference between first and second-generation machines (Chapter2);
2. a historical review of second-generation stack computers which provides a summary of
many lesser-known and unpublished machines (Chapter 3);
3. a set of criteria to distinguish first and second-generation stack computers which expand
on those given by Feldman and Retter [FR93, pp.599-604] (Chapter 4);
4. a rebuttal of the arguments against stack computers cited by Hennessy and Patterson,
showing that they are applicable to the first generation of stack computers, but not the
second (Chapter 5);
5. a register-transfer level description of Gullwing, a simple, modern second-generation
stack computer, along with an optimization of its instruction fetching mechanism (Chapter 6);
6. an initial comparison of the execution statistics of non-numerical code on both Gullwing
and MIPS (Section 7.2);
7. a detailed comparison of the behaviour of iteration, recursion, tail-recursion, and subroutine calls on both Gullwing and MIPS (Section 7.3);
8. the design of the Gullwing pipeline as a transformation of the DLX pipeline (Section
7.4);
9. the proposal of an instruction stack to maximize the code density and accelerate the
subroutine returns of Gullwing (Chapter 8).
$30
...
$2
$1
$0
DS
Stack
CALL
RET
JMP
JMP0
LIT
@
!
MIPS
jal $Ln
j $31
b $Ln
beq $1,$0,$Ln
addi $1,$0,$Ln
lw $1,($30)
sw $1,($30)
Stack
NOT
AND
+
DROP
SWAP
PUSH
POP
MIPS
xor $1,$1,-1
and $1,$1,$2
add $1,$1,$2
add $0,$0,$1
add $1,$2,$0
add $1,$30,$0
add $30,$1,$0
$30,$1,$0
$30,$0,$L1
$1,$30,$0
$1,$0,-1
Tri
$1,$1,$2
$1,$1,$2
$31
//
//
//
//
//
//
//
//
Instructions
move
addu, addiu
sw
lw
beq
jal
j
Without With
2
0
4
5
2
0
2
0
1
1
1
1
1
1
POP
JMP0
PUSH
LIT
CALL
+
+
Table 9.2: Recursive MIPS32 Instruction DisRET
tribution With and Without Stacks
Entire Code
Without With With/Without
Mem Words
13
8
0.62
Instructions
13
8
0.62
Mem Accesses (instr+data) (13+4)
8
0.62
Cycles
13
8
0.62
Derived Measures
Without With With/Without
Accesses/Cycle
1.31
1.00
0.76
Cycles/Instruction
1.00
1.00
1.00
Instructions/Word
1.00
1.00
1.00
Table 9.3: Triangular Recursive MIPS32 Code Comparison With and Without Stacks
95
Outputs
--------N Control
- ------0 TOP,DSALU(+)TOP, DS(POP), ISR< <
0 TOP,DSALU(+)TOP, TOPDS, DS(PUSH), ISR< <
1 DSTOP, DS(POP), MEMPC,MAR
0 DSTOP, DS(POP), PC(+1)PC,MAR, ISR< <
0 PC@
96
PC@
Outputs
--------N Control
- ------0 DSTOP, DS(POP), AMAR,(+1)A, ISR< <
0 MEMTOP, TOPDS, DS(PUSH), RMAR,(+1)R, ISR< <
0 MEMTOP, TOPDS, DS(PUSH), PCMAR, ISR< <
0 TOP,DSALU(XOR)TOP, DS(POP), \\ cont. next line
RS(POP), RSR, RPC,MAR, ISR< <
0 0 PC(+1)PC,MAR, MEMISR
By comparison, MIPS R-type instructions use a total of twelve opcode bits, but do not encode 212 unique
instructions. The encoding must thus be sparse and simple to decode.
2
A simple hypothetical comparison routine based on COMPARE_STRING from Appendix A.1.3.
97
98
Appendix A
Gullwing Benchmarks Source
The appendix provides the source code for the benchmarks described in Section 7.1.
Contains the address of the next memory word where code is to be compiled.
SLOT Contains the bitmask which defines the current available instruction slot in the memory word pointed-to by HERE.
THERE Contains the address of the top of the function name dictionary. Also the pointer to
the bottom of the input buffer.
NAME_END Contains the address of the end of the function name dictionary. It is used to
detect the failure of a dictionary search.
99
INPUT Contains the address of the top of the input buffer, which is the beginning of the most
recently received string.
Start of Memory
Function Code
HERE
HERE_NEXT
Free Memory
INPUT
Input Buffer
THERE
Name Dictionary
NAME_END
Figure A.1: Flight Language Kernel Memory Map
Count
Body
100
Tail
SCAN_STRING
Setup at fifth instruction slot, so the next compilation will occur at the zeroth
// update HERE
LIT [address of HERE] >A A!
JMP NULL_SLOT
NEXT_SLOT Point to first slot if HERE is empty (NULL_SLOT), else point to next free
slot at HERE, else ALIGN.
// get slot mask
LIT [address of SLOT] >A A@
LIT [null slot mask] OVER XOR JMP0 HEREEMPTY
LIT [fifth slot mask] OVER XOR JMP0 HEREFULL
// next 5-bit slot
2* 2* 2* 2* 2* A! RET
HEREFULL:
DROP CALL ALIGN JMP FIRST_SLOT
HEREEMPTY:
DROP JMP FIRST_SLOT
DEFN_AS Takes the ALIGNed address of a code entry (usually [HERE]) and the address of
a SCANed string (from INPUT) and converts it to a name entry. Updates THERE.
The name string must be the only one in the input buffer stack as it is converted in place,
so INPUT must be pointing to it and its tail must be at THERE. Because of this, INPUT does
not need to be changed.
// store address in [THERE]
LIT [address of THERE] >A A@ >A A!
// get string start address
LIT [address of INPUT] >A A@
// move to first free location before it
LIT -1 +
LIT [address of THERE] >A A! RET
NEW_WORD Returns an aligned address at which to compile code.
CALL ALIGN
// get [HERE]
LIT [address of HERE] >A A@ RET
DEFN
CALL NEW_WORD
JMP DEFN_AS
105
COMPILE:
LIT [address of HERE] >A A@
// compile opcode, drop mask, return
>A A@ + A! DROP RET
COMPILE_LITERAL Takes a procedure address or literal (for CALL, JMP, JMP0, or LIT)
and compiles it at HERE_NEXT. Increments HERE_NEXT.
LIT [address of HERE_NEXT] >A A@gate
// store address/literal and increment
>A A!+
// update HERE_NEXT
A> LIT [address of HERE_NEXT] >A A! RET
CMPC, (CALL) Takes the address of a procedure an compiles a call to it. Then aligns to the
next free word since later slots in current word will never execute.
LIT [opcode for CALL]
CALL COMPILE_OPCODE
CALL COMPILE_LITERAL
JMP ALIGN
CMPJ, (JMP)
LIT [opcode for JMP]
CALL COMPILE_OPCODE
CALL COMPILE_LITERAL
JMP ALIGN
CMP0, (JMP0) If JMP0 is not taken, the next opcode in the same memory word runs instead,
unlike CALL and JMP.
LIT [opcode for JMP0]
CALL COMPILE_OPCODE
JMP COMPILE_LITERAL
CMP+, (JMP+)
LIT [opcode for JMP+]
CALL COMPILE_OPCODE
JMP COMPILE_LITERAL
NUMC, (LIT)
NXEC This is the user-interface loop. Reads in a string, looks up the function address, and
calls to it. In effect, the kernel does nothing but execute all commands given to it.
CALL SCAN
CALL LOOK
CALL POP_STRING
CALL EXECUTE
JMP NXEC
EXEC Reads in the address of a function and calls to it. It is an alternate main loop meant
to be used by communicating instances of the kernel since it is more efficient to pass function
addresses than names.
CALL READ1
CALL EXECUTE
JMP EXEC
108
Moving data, input, and output is denoted by angle brackets < and >.
The Flight language evolves very quickly at the beginning. The first few functions are used
throughout the latter code and must be understood before proceeding further.
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
alias
CMPJMPZERO
(JMP0)
CMPJMPPLUS
(JMP+)
CMPCALL
(CALL)
CMPRET
;
NUMC
#
NUMI
$n
LOOK
$l
POP_STRING
$pop
DEFN
$:
SCAN
>$
WRITE1
#>
READ1
>#
TENSTAR
10*
CMPLOADA
(A@)
CMPSTOREA
(A!)
CMPLOADAPLUS (A@+)
CMPSTOREAPLUS (A!+)
CMPLOADRPLUS (R@+)
CMPSTORERPLUS (R!+)
CMPXOR
(XOR)
CMPAND
(AND)
CMPNOT
(NOT)
CMPTWOSTAR
(2*)
CMPTWOSLASH
(2/)
CMPPLUS
(+)
CMPPLUSSTAR
(+*)
CMPDUP
(DUP)
CMPDROP
(DROP)
CMPOVER
(OVER)
CMPTOR
(>R)
CMPRFROM
(R>)
CMPTOA
(>A)
CMPAFROM
(A>)
CMPNOP
(NOP)
111
@
!
XOR
AND
NOT
OR
2*
2/
+
+*
DUP
DROP
OVER
NOP
(>A) (A@) ;
(>A) (A!) ;
(XOR) ;
(AND) ;
(NOT) ;
(OVER) (NOT) (AND) (XOR) ;
(2*) ;
(2/) ;
(+) ;
(+*) ;
(DUP) ;
(DROP) ;
(OVER) ;
(NOP) ;
112
\a
\b
\t
\n
\v
\f
\r
\s
n#
n#
n#
n#
n#
n#
n#
n#
7 n# 1 c #> j #>
8 n# 1 c #> j #>
9 n# 1 c #> j #>
10 n# 1 c #> j #>
11 n# 1 c #> j #>
12 n# 1 c #> j #>
13 n# 1 c #> j #>
32 n# 1 c #> j #>
113
114
115
116
: forget
c >$
l# THERE (@) (N+) 1 (DUP)
: forget-loop
l# INPUT (@)
c COMPARE_STRINGS
c match?
if (DROP) c STRING_TAIL (N+) 1 (DUP) c end?
if
(DUP) j forget-loop
else
(DROP) c POP_STRING ;
else
c erase (DROP) ;
(+*)
(+*)
(+*)
(+*)
(2/)
(2/)
(2/)
(2/)
(+*) (2/)
(+*) (2/)
(+*) (2/)
(+*)
117
: U/
c divby0check
n# 0 (>A)
: U/-loop
(OVER) (>R) (DUP) (R>) c <=
if
(DUP) (>R) (-) (R>)
(A>) (N+) 1 (>A)
j U/-loop
else
(DROP) (A>) ;
118
119
122
n 3 caesargen encode
-n 3 caesargen decode
// Takes the address of a string
// and the name of a function.
// Maps the function to each
// element of the string.
: map1
(DUP) (>R) c STRING_TAIL (R>)
(N+) 1
c l (>R)
: map1-loop
(DUP) (R>) (DUP) (>R) c EXECUTE (N+) 1
(OVER) (OVER) (XOR)
if j map1-loop
else (DROP) (DROP) (R>) (DROP) ;
// Input: ABCD
// Output: ABCD DEFG ABCD
>$ ABCD
// Print the string
l INPUT @ DUP cs> \t
DUP map1 encode
// Print the ciphered version
DUP cs> \t
// Print the deciphered version
DUP map1 decode
$> \t \n
// Since the argument is a constant
// it can be compiled as a literal fetch instead
: caesargen
c :
c #
l# caesar c (JMP) ;
123
A.3.1 VM
// Define an 8kB memory for the VM
: MEMSIZE n# 8192 ;
: MEM MEMSIZE var ;
: OPCODEWIDTH n# 5 ;
: OPCODEMASK n# 31 ;
: MEM_INPUT MEM MEMSIZE + -n 2 + # ;
: MEM_OUTPUT MEM MEMSIZE + -n 1 + # ;
: MEMHEAD MEM # ;
: MEMTAIL MEM_OUTPUT # ;
: (check_low)
MEMHEAD negate # c # c (+) ;
: (check_high)
c (negate) MEMTAIL # c # c (+) ;
: mem_in_range?
(DUP) (check_low) (OVER) (check_high) (OR) ;
: mem_access_msg
create >$ ILLEGAL_MEMORY_ACCESS: $>c
does
j cs>
: report_mem_error
c mem_access_msg
c \s c #$> j \n
: access_check
c mem_in_range?
ifc report_mem_error
j errcontext
else ;
125
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
PCFETCHopcode
n# 0 ;
CALLopcode
n# 1 ;
RETopcode
n# 2 ;
JMPopcode
n# 3 ;
JMPZEROopcode
n# 4 ;
JMPPLUSopcode
n# 5 ;
LOADAopcode
n# 6 ;
STOREAopcode
n# 7 ;
LOADAPLUSopcode n# 8 ;
STOREAPLUSopcode n# 9 ;
LOADRPLUSopcode n# 10 ;
STORERPLUSopcode n# 11 ;
LITopcode
n# 12 ;
UND0opcode
n# 13 ;
UND1opcode
n# 14 ;
UND2opcode
n# 15 ;
XORopcode
n# 16 ;
ANDopcode
n# 17 ;
NOTopcode
n# 18 ;
TWOSTARopcode
n# 19 ;
TWOSLASHopcode
n# 20 ;
PLUSopcode
n# 21 ;
PLUSSTARopcode
n# 22 ;
DUPopcode
n# 23 ;
DROPopcode
n# 24 ;
OVERopcode
n# 25 ;
TORopcode
n# 26 ;
RFROMopcode
n# 27 ;
TOAopcode
n# 28 ;
AFROMopcode
n# 29 ;
NOPopcode
n# 30 ;
UND3opcode
n# 31 ;
126
: do_call
// Move run_vm return address
(R>)
PCREG # (@)
(DUP) (N+) 1 (>R)
(@) c access_check (DUP)
(N+) 1 PCREG # (!)
(@) ISRREG # (!)
// Restore run_vm return address
(>R) ;
: do_ret
(R>)
(R>) c access_check (DUP)
(N+) 1 PCREG # (!)
(@) ISRREG # (!)
(>R) ;
: do_jmp
PCREG # (@) (@)
c access_check (DUP)
(N+) 1 PCREG # (!)
(@) ISRREG # (!) ;
: do_jmpzero
if
PCREG # (@) (N+) 1 (A!) ;
else
j do_jmp
: do_jmpplus
ifPCREG # (@) (N+) 1 (A!) ;
else
j do_jmp
: do_loada
AREG # (@)
(DUP) MEM_INPUT # (XOR)
if
c access_check (@) ;
else
(DROP) j >#
127
: do_storea
AREG # (@)
(DUP) MEM_OUTPUT # (XOR)
if
c access_check (!) ;
else
(DROP) j #>
: do_loadaplus
AREG # (@)
(DUP) MEM_INPUT # (XOR)
if
c access_check
(>A) (A@+) (A>)
AREG # (!) ;
else
(N+) 1 AREG # (!) j >#
: do_storeaplus
AREG # (@)
(DUP) MEM_OUTPUT # (XOR)
if
c access_check
(>A) (A!+) (A>)
AREG # (!) ;
else
(N+) 1 AREG # (!) j #>
: do_loadrplus
(R>) (>A)
(R>)
(DUP) MEM_INPUT # (XOR)
if
c access_check
(>R) (R@+)
(A>) (>R) ;
else
(N+) 1 (>R)
(A>) (>R)
j >#
128
: do_storerplus
(R>) (>A)
(R>)
(DUP) MEM_OUTPUT # (XOR)
if
c access_check
(>R) (R!+)
(A>) (>R) ;
else
(N+) 1 (>R)
(A>) (>R)
j #>
: do_lit
PCREG # (@)
(DUP) (N+) 1 (A!)
(@) ;
:
:
:
:
:
:
:
:
:
:
:
do_und
n
do_xor
do_and
do_not
do_twostar
do_twoslash
do_plus
do_plusstar
do_dup
do_drop
do_over
31 COMPILE_OPCODE ;
(XOR) ;
(AND) ;
(NOT) ;
(2*) ;
(2/) ;
(+) ;
(+*) ;
(DUP) ;
(DROP) ;
(OVER) ;
: do_tor
(R>) (>A)
(>R)
(A>) (>R) ;
: do_rfrom
(R>) (>A)
(R>)
(A>) (>R) ;
: do_toa
AREG # (!) ;
: do_afrom AREG # (@) ;
: do_nop
(NOP) ;
129
: ,
c #>c c ALIGN ;
: &>
c l c , ;
// Indexed by the opcode
: instruction_call_table
create
&> do_pcfetch
&> do_call
&> do_ret
&> do_jmp
&> do_jmpzero
&> do_jmpplus
&> do_loada
&> do_storea
&> do_loadaplus
&> do_storeaplus
&> do_loadrplus
&> do_storerplus
&> do_lit
&> do_und
&> do_und
&> do_und
&> do_xor
&> do_and
&> do_not
&> do_twostar
&> do_twoslash
&> do_plus
&> do_plusstar
&> do_dup
&> do_drop
&> do_over
&> do_tor
&> do_rfrom
&> do_toa
&> do_afrom
&> do_nop
&> do_und
does ;
130
: (shift_isr)
c (2/) c (2/) c (2/) c (2/) c (2/) ;
: (extract_instruction)
OPCODEMASK # c # c (AND) ;
: do_next_instruction
ISRREG # (@)
(DUP) (shift_isr) (A!)
(extract_instruction)
instruction_call_table # (+) (@)
(>R) ;
// All input now gets processed by the software
// inside the VM instead of the original Flight language kernel
: run_vm
MEM # PCREG # (!)
n 0 # ISRREG # (!)
n 123456 # AREG # (!)
: vm_loop
c do_next_instruction
j vm_loop
// Output short message to show
// that compilation reached this point
>$ VM4-1 $> \n
A.3.2 Metacompiler
The metacompiler saves and restores the internal state of the language kernel. This allows
redirecting the operation of the kernel to a different memory area. In this case, it is used to
direct compilation and execution of code to the previously defined Virtual Machine memory
area.
// Virtual Machine state
: vm_here n 1 var ;
: vm_here_next n 1 var ;
: vm_there n 1 var ;
: vm_slot n 1 var ;
: vm_input n 1 var ;
: vm_name_end n 1 var ;
// Native Machine state
: nm_here n 1 var ;
: nm_here_next n 1 var ;
131
:
:
:
:
nm_there n 1 var ;
nm_slot n 1 var ;
nm_input n 1 var ;
nm_name_end n 1 var ;
:
:
:
:
:
:
:
:
:
:
:
:
132
(!) ;
;
(!) ;
(!) ;
c save_nm_here
c save_nm_here_next
c save_nm_slot
c save_nm_there
c save_nm_input
c save_nm_name_end
c restore_vm_here
c restore_vm_here_next
c restore_vm_slot
c restore_vm_there
c restore_vm_input
c restore_vm_name_end
c init_name_dict
// Change the main loop
(unwind)
j vm_nxec
// End execution inside VM
: )VM
c save_vm_here
c save_vm_here_next
c save_vm_slot
c save_vm_there
c save_vm_input
c save_vm_name_end
c restore_nm_here
c restore_nm_here_next
c restore_nm_slot
c restore_nm_there
c restore_nm_input
c restore_nm_name_end
// Change the main loop
(unwind)
j NXEC
// Print this to signal we got this far
>$ METACOMP1 $> \n
134
HERE
HERE_NEXT
THERE
SLOT
INPUT
NAME_END
: SCAN_STRING
l# INPUT (@) (>R) (R@+)
: SS_LOOP
c READ1 (R!+)
-n# 1 (+) (DUP)
if
j SS_LOOP
else
(DROP) (R>) (DUP) (!) ;
:
c
c
j
SCAN
READ1
PUSH_STRING
SCAN_STRING
: FIRST_SLOT
l# SLOT (>A)
n# 31 (A!) ;
: LAST_SLOT
l# SLOT (>A)
n# 1040187392 (A!) ;
: NULL_SLOT
l# SLOT (>A)
n# 0 (A!) ;
: ALIGN
l# HERE_NEXT (@)
(DUP) (>R) n# 0 (R!+)
(R>) (A!)
l# HERE (!)
j NULL_SLOT
: NEXT_SLOT
l# SLOT (@)
n# 0 (OVER) (XOR)
if
n# 1040187392 (OVER) (XOR)
if
(2*) (2*) (2*) (2*) (2*) (A!) ;
else
(DROP) c ALIGN j FIRST_SLOT
else
(DROP) j FIRST_SLOT
136
: DEFN_AS
l# THERE (@) (!)
l# INPUT (@)
-n# 1 (+)
l# THERE (!) ;
: NEW_WORD
c ALIGN
l# HERE (@) ;
: DEFN
c NEW_WORD
j DEFN_AS
: LOOK
l# THERE (@) n# 1 (+) (DUP)
: LOOK_LOOP
l# INPUT (@)
c COMPARE_STRINGS
l# INPUT (@)
c STRING_TAIL (XOR)
if
(DROP) c STRING_TAIL
n# 1 (+)
(DUP) l# NAME_END (@) (XOR)
if
(DUP) j LOOK_LOOP
else
(DROP) l# NAME_END (@) ;
else
(>A) (DROP) (A@) ;
: COMPILE_OPCODE
c NEXT_SLOT
(>R) l# SLOT (@) (NOT)
(R>) (OVER) (OVER) (AND) if
(2*) (2*) (2*) (2*) (2*)
(OVER) (OVER) (AND) if
(2*) (2*) (2*) (2*) (2*)
(OVER) (OVER) (AND) if
(2*) (2*) (2*) (2*) (2*)
(OVER) (OVER) (AND) if
(2*) (2*) (2*) (2*) (2*)
(OVER) (OVER) (AND) if
137
:
c
c
j
EXEC
READ1
EXECUTE
EXEC
:
c
c
c
c
j
NXEC
SCAN
LOOK
POP_STRING
EXECUTE
NXEC
: TENSTAR
(DUP) (2*) (2*) (2*) (OVER) (+) (+) ;
: NUMI
l# INPUT (@)
(>R) (R@+)
n# 0 (>A)
(DUP)
if
: NUMI_LOOP
(A>) c TENSTAR (R@+) -n# 48 (+) (+) (>A)
-n# 1 (+) (DUP)
if
j NUMI_LOOP
else
else
(DROP) (R>) (DROP) (A>)
j POP_STRING
l NXEC l START_WORD !
save_vm_here vm_here @ l HERE !
save_vm_here_next vm_here_next @
save_vm_slot vm_slot @ l SLOT !
save_vm_input vm_input @ l INPUT
save_vm_there vm_there @ l THERE
save_vm_name_end vm_name_end @ l
)VM
>$ FIF1 $> \n
139
l HERE_NEXT !
!
!
NAME_END !
140
Appendix B
Static and Dynamic Gullwing Code
Analyses
The following analyses are based on the software developed in Appendix A for the Gullwing
processor. The analyses are imperfect since it cannot always be known if a memory word
contains instructions or a literal. For example, this happens wherever the software contains
directly accessed memory locations. These memory locations are neither executed nor accessed
in-line like literal fetches. The analysis software considers these memory locations to contain
instructions by default, despite actually being literals. Fortunately, such cases are infrequent
and contribute very little noise to the data, showing up as rare, large literal values and UND
instructions.
VM Ratio
3029 1.000
1565 0.517
1464 0.483
593
871
0.405
0.595
Bits
4
8
12
16
20
24
28
32
Bare
56
17
1
2
0
0
0
3
Ratio
0.709
0.215
0.013
0.025
0.000
0.000
0.000
0.038
Ext.
168
22
13
4
2
1
1
4
Ratio
0.781
0.102
0.060
0.019
0.009
0.005
0.005
0.019
VM
340
63
73
101
5
2
2
7
Ratio
0.573
0.106
0.123
0.170
0.008
0.003
0.003
0.012
Bits
4
8
12
16
20
24
28
32
Bare
3
70
7
1
0
2
0
1
Ratio
0.036
0.833
0.083
0.012
0.000
0.024
0.000
0.012
Ext.
22
187
164
2
0
0
2
0
Ratio
0.058
0.496
0.435
0.005
0.000
0.000
0.005
0.000
VM
34
224
516
95
0
0
2
0
Ratio
0.039
0.257
0.592
0.109
0.000
0.000
0.002
0.000
142
#/Word
0
1
2
3
4
5
6
Bare
0
28
38
3
16
7
31
Ratio
0.000
0.228
0.309
0.024
0.130
0.057
0.252
Ext.
32
242
132
44
34
18
91
Ratio
0.054
0.408
0.223
0.074
0.057
0.030
0.153
VM
126
595
347
140
75
44
238
Ratio
0.081
0.380
0.222
0.089
0.048
0.028
0.152
Bare
1.392
2.580
3.236
398
738
0.539
Ext.
1.190
3.003
2.380
1410
3558
0.396
143
VM
1.207
3.100
2.337
3657
9390
0.389
Instr.
JMP0
JMP+
CALL
RET
JMP
PC@
LIT
@A
@A+
!A
!A+
@R+
!R+
XOR
AND
NOT
2*
2/
+
+*
DUP
DROP
OVER
>R
R>
>A
A>
NOP
UND0
UND1
UND2
UND3
Bare
14
0
30
20
40
340
79
22
4
19
2
4
2
5
5
2
33
0
16
0
12
9
13
6
5
43
8
1
1
0
0
3
C/I
0.035
0.000
0.075
0.050
0.101
0.854
0.198
0.055
0.010
0.048
0.005
0.010
0.005
0.013
0.013
0.005
0.083
0.000
0.040
0.000
0.030
0.023
0.033
0.015
0.013
0.108
0.020
0.003
0.003
0.000
0.000
0.008
C/S
0.019
0.000
0.041
0.027
0.054
0.461
0.107
0.030
0.005
0.026
0.003
0.005
0.003
0.007
0.007
0.003
0.045
0.000
0.022
0.000
0.016
0.012
0.018
0.008
0.007
0.058
0.011
0.001
0.001
0.000
0.000
0.004
Ext.
33
19
244
141
81
2148
215
54
4
33
5
5
4
19
8
18
49
17
83
18
59
56
41
43
41
93
14
2
0
2
1
8
C/I
0.023
0.013
0.173
0.100
0.057
1.523
0.152
0.038
0.003
0.023
0.004
0.004
0.003
0.013
0.006
0.013
0.035
0.012
0.059
0.013
0.042
0.040
0.029
0.030
0.029
0.066
0.010
0.001
0.000
0.001
0.001
0.006
C/S
0.009
0.005
0.069
0.040
0.023
0.604
0.060
0.015
0.001
0.009
0.001
0.001
0.001
0.005
0.002
0.005
0.014
0.005
0.023
0.005
0.017
0.016
0.012
0.012
0.012
0.026
0.004
0.001
0.000
0.001
0.000
0.002
VM
75
40
566
419
190
5733
593
154
17
120
53
14
17
47
19
43
104
39
184
41
136
123
87
109
101
286
39
8
2
6
8
17
144
C/I
0.021
0.011
0.155
0.115
0.052
1.568
0.162
0.042
0.005
0.033
0.014
0.004
0.005
0.013
0.005
0.012
0.028
0.011
0.050
0.011
0.037
0.034
0.024
0.030
0.028
0.078
0.011
0.002
0.001
0.002
0.002
0.005
C/S
0.008
0.004
0.060
0.045
0.020
0.611
0.063
0.016
0.002
0.013
0.006
0.001
0.002
0.005
0.002
0.005
0.011
0.004
0.020
0.004
0.014
0.013
0.009
0.012
0.011
0.030
0.004
0.001
0.000
0.001
0.001
0.002
Test
Instructions
Cycles
Ext.
VM
5,018,751 204,325,372
6,574,996 265,567,537
145
Instruction
JMP0
JMP+
JMP0 TAKEN
JMP+ TAKEN
CALL
RET
JMP
PC@
FOLDS
LIT
@A
@A+
!A
!A+
@R+
!R+
XOR
AND
NOT
2*
2/
+
+*
DUP
DROP
OVER
>R
R>
>A
A>
NOP
UND0
UND1
UND2
UND3
Ext.
304,141
63
23,307
137
320,857
321,997
107,803
230,798
210,080
636,924
321,272
320,744
12,909
428
120,753
6038
319,174
2247
2249
9914
0
515,465
0
212,572
103,643
6777
104,923
103,781
633,291
302,944
0
0
0
0
0
C/I
0.061
0.000
0.005
0.000
0.064
0.064
0.021
0.041
0.042
0.127
0.064
0.064
0.003
0.000
0.024
0.001
0.064
0.000
0.001
0.002
0.000
0.103
0.000
0.042
0.021
0.001
0.021
0.021
0.126
0.060
0.000
0.000
0.000
0.000
0.000
C/C
0.046
0.000
0.007
0.000
0.098
0.098
0.033
0.031
0.032
0.097
0.098
0.098
0.004
0.000
0.037
0.002
0.049
0.000
0.000
0.002
0.000
0.078
0.000
0.032
0.016
0.001
0.016
0.016
0.096
0.046
0.000
0.000
0.000
0.000
0.000
VM
2,280,533
63
83,610
1,875,075
10,143,368
15,306,286
5,617,365
4,428,928
15,381,940
32,273,461
16,753,698
1,546,398
9,326,086
1272
560,414
28,593
4,203,870
7,042,637
3,758,267
32,069
25,802,660
15,671,697
0
10,923,102
511,936
3,770,383
7,110,502
1,947,580
21,482,640
1,843,179
0
0
0
0
0
146
C/I
0.011
0.000
0.000
0.009
0.050
0.075
0.027
0.022
0.075
0.158
0.082
0.008
0.046
0.000
0.003
0.000
0.021
0.034
0.018
0.000
0.126
0.077
0.000
0.053
0.003
0.018
0.035
0.010
0.105
0.009
0.000
0.000
0.000
0.000
0.000
C/C
0.009
0.000
0.001
0.014
0.076
0.115
0.042
0.017
0.058
0.122
0.126
0.012
0.070
0.000
0.004
0.000
0.016
0.027
0.014
0.000
0.097
0.059
0.000
0.041
0.002
0.014
0.027
0.007
0.081
0.007
0.000
0.000
0.000
0.000
0.000
Test
Best
Actual
Worst
Ext.
VM
1.305 1.290
1.310 1.300
1.371 1.311
Test
Instr. Type
Members
Conditionals
JMP+, JMP0
Conditionals
JMP+ TAKEN,
(Taken)
JMP0 TAKEN
Subroutine
CALL, RET, JMP
Fetches
PC@, LIT
Load/Store
@A, @A+, !A,
!A+, @R+, !R+
Arithmetic & XOR, AND, NOT,
Logic
2*, 2/, +, +*
Stack
DUP, DROP, OVER,
Manipulation
>R, R>, >A, A>
NOP/UND
NOP, UND[0-3]
Extensions
Count
Freq.
304,204 0.061
23,444
0.005
Virtual Machine
Count
Freq.
2,280,596 0.011
1,958,685 0.010
2
1
2
750,657
840,722
782,144
0.150
0.168
0.156
31,067,019 0.152
36,702,389 0.180
28,216,461 0.138
849,649
0.169
56,510,900 0.277
1,467,931 0.292
47,589,322 0.233
Cycles
1
2
1
1
147
0.000
0.000
Instructions
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
17
20
Total Blocks
Average
Ext.
Ratio
24,613 0.023
300,464 0.279
394
0.000
230,485 0.214
208,813 0.194
103,499 0.096
3292
0.003
100,262 0.093
100,824 0.094
3108
0.003
77
0.000
43
0.000
61
0.000
634
0.001
1728
0.002
8
0.000
1,078,305
3.654
VM
Ratio
17,143,651 0.486
2,372,695 0.067
661,029
0.019
2,118,717 0.060
940,229
0.027
741,651
0.021
1,099,523 0.031
719,878
0.020
394,367
0.011
647,462
0.018
317,034
0.009
322,764
0.009
132,018
0.004
646,811
0.018
1,884,477 0.053
3462
0.000
5,160,532 0.146
35,306,300
4.787
148
Ext.
Ratio
109,303 0.022
876,384 0.175
1,379,820 0.275
1,437,394 0.286
879,234 0.175
272,451 0.054
52,157
0.010
9314
0.002
2184
0.000
463
0.000
47
0.000
2.636
VM
1,469,323
13,429,593
29,249,154
44,388,630
44,484,643
34,373,299
20,746,317
9,725,059
4,301,403
1,710,608
367,851
65,179
11,231
2677
405
3.924
149
Ratio
0.007
0.066
0.143
0.217
0.218
0.168
0.102
0.048
0.021
0.008
0.002
0.000
0.000
0.000
0.000
Depth
0
1
2
3
4
5
6
7
8
9
10
11
Average
Ext.
Ratio
5666
0.001
1,862,081 0.371
1,452,360 0.289
1,147,591 0.229
379,797 0.076
169,104 0.034
1978
0.000
174
0.000
2.110
VM
19,823
8,362,538
15,583,345
57,244,617
48,202,927
44,737,258
17,140,510
10,796,692
1,452,808
778,445
6079
330
4.037
150
Ratio
0.000
0.041
0.076
0.280
0.236
0.219
0.084
0.053
0.007
0.004
0.000
0.000
Bibliography
[ABB64]
[All85]
[Bai94]
Chris Bailey, HLL enhancement for stack based processors, EuroMicro Journal
of Microprocessing and Microprogramming 40 (1994), 665668.
[Bai96]
, Optimisation Techniques for Stack-Based Processors, PhD thesis, University of Teesside, UK, July 1996, Amongst other things, analyzes the use of
stack buffers to reduce memory traffic.
[Bai00]
[Bai04]
[Bar61a]
[Bar61b]
[Bar61c]
[Bar87]
[Bau60]
[Bau90]
F. L. Bauer, The cellar principle of state transition and storage allocation, IEEE
Ann. Hist. Comput. 12 (1990), no. 1, 4149.
[Bau02]
Friedrich L. Bauer, From the Stack Principle to ALGOL, pp. 2642, in Broy and
Denert [BD02], 2002, Points to possible earlier origins of stacks for computation.
[BBG+ 60]
[BBRS58]
[BFPB97]
[BI86]
Leo B. Brodie and FORTH Inc., Starting FORTH, second ed., Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 1986.
[Bla77]
[Bla90]
[Bro84]
[BSa]
Friedrich Ludwig Bauer and Klaus Samelson, French patent 1.204.424: Machine
calculer automatique et procd pour son exploitation, Filed March 28, 1958.
Delivered August 10, 1959. Published January 26, 1960.
[BSb]
[BSc]
[BSd]
[BS94]
C. Bailey and R. Sotudeh, HLL enhancement for stack based processors, Selected
papers of the short notes session on Euromicro 94 (Amsterdam, The Netherlands,
The Netherlands), Elsevier Science Publishers B. V., 1994, pp. 685688.
[Bul77]
[Bur63]
[Bur67]
[Bur73]
[Bur81]
[Bur98]
[Car58]
[Cha95]
Robert James Chapman, Stack quarks, Proc. Rochester Forth Conference on Emerging Technology (Rochester, New York) (Lawrence
P. G. Forsley, ed.), University of Rochester, The Institute for Applied
Forth Research, Inc., June 1995, Decomposes the typical stack permutation operations into smaller primitives. Online as of Nov. 2006:
http://www.compusmart.ab.ca/rc/Papers/StackQuarksPaper.pdf.
[Cha97]
[Cha98]
Rob
Chapman,
A Stack Processor: Synthesis,
web
page,
January 1998, Project Report for EE602. Online as of Nov. 2006:
http://www.compusmart.ab.ca/rc/Papers/spsynthesis.pdf.
[Chu75]
[DBL00]
[Dor75a]
[Dor75b]
[DP80]
[DP86]
[DP98a]
[DP98b]
[Dun77]
Fraser George Duncan, Stack machine development: Australia, great britain, and
europe, Computer 10 (1977), no. 5, 5052.
[Eng63]
English
Electric-LEO
Computers
Ltd.,
Kidsgrove,
Stoke-OnTrent, Staffordshire, England, KDF9 programming manual, circa
1963, Online as of March 2006 at:
http://www.jeays.ca/kdf9.html
and
http://frink.ucg.ie/~bfoley/edhist/kdf9pm/kdf9pm.html
and
http://acms.synonet.com/deuce/KDF9pm.pdf.
[FL86]
[Fox98]
1998,
[Fox04]
2004,
[FR93]
[Fre98]
Paul Frenger, Forth in space, or, so NEAR yet so far out, SIGPLAN Not. 33
(1998), no. 6, 2426.
[Fre01]
[GL03]
William F. Gilreath and Phillip A. Laplante, Computer Architecture, Kluwer Academic Publishers, 2003.
154
[GWS91]
II George William Shaw, Sh-BOOM: the sound of the RISC market changing,
Proceedings of the second and third annual workshops on Forth, ACM Press,
1991, p. 125.
[Hal62]
A. C. D. Haley, The KDF.9 computer system, Proceedings of the AFIPS Fall Joint
Computer Conference, vol. 21, 1962, pp. 108120.
[Ham57a]
[Ham57b]
[Ham85]
[Hay97]
[Hen84]
[Hen86]
[Hew84]
[HH00]
[HJB+ 82]
John Hennessy, Norman Jouppi, Forest Baskett, Thomas Gross, and John Gill,
Hardware/software tradeoffs for increased performance, ASPLOS-I: Proceedings
of the first international symposium on Architectural support for programming
languages and operating systems (New York, NY, USA), ACM Press, 1982,
pp. 211.
[HJP+ 82]
[HP96]
[HP02]
[HVZ95]
[Hwa92]
[IC78]
[Int00]
Intersil, Data sheet for HS-RTX2010RH, March 2000, File Number 3961.3.
[Int06]
[Kat85]
[KDF61]
KDF9: Very high speed data processing system for commerce, industry, science,
English Electric, Kidsgrove, Stoke-On-Trent, Staffordshire, England, 1961, Sales
brochure for the KDF9.
[KKC92a]
[KKC92b]
[Kog90]
[Koo89]
Philip J. Koopman, Stack computers: the new wave, Halsted Press, 1989, A compendium of stack computer architectures. Has useful experimental data.
[Koo90]
, Modern stack computer architecture, System Design and Network Architecture Conference (1990), 153164.
[Koo91]
[Koo94]
[Lav80]
Simon Hugh Lavington, Early british computers: The story of vintage computers
and the people who built them, Butterworth-Heinemann, Newton, MA, USA,
1980.
[LTL98]
[Luk29]
Jan Lukasiewicz, Elements of mathematical logic, Warsaw, 1929, [English translation of 1958 edition: Macmillan, 1963].
[McK80]
[McL93]
Edward McLellan, The Alpha AXP architecture and 21064 processor, IEEE Micro 13 (1993), no. 3, 3647.
[ME97]
[ME98]
[MK97]
M. Morris Mano and Charles R. Kime, Logic and computer design fundamentals,
Prentice-Hall, Inc., 1997.
[ML70]
[Moo91]
Charles H. Moore, Forth - the early years, Unpublished notes that became
the papers by Rather, Colburn, and Moore [RCM93] [RCM96]. Accesible at
http://www.colorforth.com/HOPL.html as of Nov. 2006., 1991.
[Moo01a]
[Moo01b]
[MP95]
[MT95]
[Mur86]
Robert W. Murphy, Under the hood of a superchip: the NOVIX Forth engine, J.
FORTH Appl. Res. 3 (1986), no. 2, 185188.
157
[Mur90]
[Omo94]
[Omo99]
[Org73]
[Pat85]
[Pat86]
[Pay96]
[Pay02]
[PD80]
David A. Patterson and David R. Ditzel, The case for the reduced instruction set
computer, SIGARCH Comput. Archit. News 8 (1980), no. 6, 2533.
[PH90]
[PH98]
[PS81]
David A. Patterson and Carlo H. Sequin, RISC I: A reduced instruction set VLSI
computer, ISCA 81: Proceedings of the 8th annual symposium on Computer
Architecture (Los Alamitos, CA, USA), IEEE Computer Society Press, 1981,
pp. 443457.
[PS98a]
[PS98b]
David A. Patterson and Carlo H. Sequin, RISC I: a reduced instruction set VLSI
computer, ISCA 98: 25 years of the international symposia on Computer architecture (selected papers) (New York, NY, USA), ACM Press, 1998, pp. 216230.
158
[Ras03]
James
Rash,
Space-related
applications
of
forth,
webpage:
http://forth.gsfc.nasa.gov/, April 2003, Presents space-related applications
of Forth microprocessors and the Forth programming language at NASA.
Accessed on Nov. 2006.
[RCM93]
[RCM96]
[Ros87]
Robert F. Rosin, Prologue: The Burroughs B5000, Annals of the History of Computing 9 (1987), no. 1, 67.
[SB60]
[SB04]
Huibin Shi and Chris Bailey, Investigating available instruction level parallelism
for stack based machine architectures, DSD 04: Proceedings of the Digital System Design, EUROMICRO Systems on (DSD04) (Washington, DC, USA), IEEE
Computer Society, 2004, pp. 112120.
[Sha99]
George William Shaw, PSC1000 microprocessor reference manual, Patriot Scientific Corporation, San Diego, CA, March 1999, Ref. No. 99-0370001.
[Sha02]
, IGNITE intellectual property reference manual, Patriot Scientific Corporation, San Diego, CA, March 2002, Revision 1.0.
[Sit78]
[SSK97]
[Sta90]
William Stallings, Computer organization and architecture, 2nd ed., Prentice Hall
PTR, 1990.
[Sta93]
[Sta02]
[Sto80]
[Sto92]
[TCCLL99] P.K. Tsang, K.H. C.C. Cheung, T.K. Lee Leung, and P.H.W. Leong,
MSL16A: an asynchronous Forth microprocessor, TENCON 99. Proceedings of
the IEEE Region 10 Conference, vol. 2, September 1999, pp. 10791082.
[Tin97a]
[Tin97b]
[Wil91]
[Wil96]
[Wil01]
[Yue]
160