Skip to content

Commit d1fcd33

Browse files
committed
Add new documentation on page format.
Martijn van Ooster
1 parent 42ef2c9 commit d1fcd33

File tree

1 file changed

+234
-88
lines changed

1 file changed

+234
-88
lines changed

doc/src/sgml/page.sgml

Lines changed: 234 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,13 @@ refers to data that is stored in <productname>PostgreSQL</productname> tables.
2222
</para>
2323

2424
<para>
25-
<xref linkend="page-table"> shows how pages in both normal <productname>PostgreSQL</productname> tables
26-
and <productname>PostgreSQL</productname> indexes
27-
(e.g., a B-tree index) are structured.
25+
26+
<xref linkend="page-table"> shows how pages in both normal
27+
<productname>PostgreSQL</productname> tables and
28+
<productname>PostgreSQL</productname> indexes (e.g., a B-tree index)
29+
are structured. This structure is also used for toast tables and sequences.
30+
There are five parts to each page.
31+
2832
</para>
2933

3034
<table tocentry="1" id="page-table">
@@ -43,113 +47,255 @@ Item
4347
<tbody>
4448

4549
<row>
46-
<entry>itemPointerData</entry>
47-
</row>
48-
49-
<row>
50-
<entry>filler</entry>
50+
<entry>PageHeaderData</entry>
51+
<entry>20 bytes long. Contains general information about the page to allow to access it.</entry>
5152
</row>
5253

5354
<row>
54-
<entry>itemData...</entry>
55+
<entry>itemPointerData</entry>
56+
<entry>List of (offset,length) pairs pointing to the actual item.</entry>
5557
</row>
5658

5759
<row>
58-
<entry>Unallocated Space</entry>
60+
<entry>Free space</entry>
61+
<entry>The unallocated space. All new tuples are allocated from here, generally from the end.</entry>
5962
</row>
6063

6164
<row>
62-
<entry>ItemContinuationData</entry>
65+
<entry>items</entry>
66+
<entry>The actual items themselves. Different access method have different data here.</entry>
6367
</row>
6468

6569
<row>
6670
<entry>Special Space</entry>
71+
<entry>Access method specific data. Different method store different data. Unused by normal tables.</entry>
6772
</row>
6873

69-
<row>
70-
<entry><quote>ItemData 2</quote></entry>
71-
</row>
74+
</tbody>
75+
</tgroup>
76+
</table>
7277

73-
<row>
74-
<entry><quote>ItemData 1</quote></entry>
75-
</row>
78+
<para>
7679

77-
<row>
78-
<entry>ItemIdData</entry>
79-
</row>
80+
The first 20 bytes of each page consists of a page header
81+
(PageHeaderData). It's format is detailed in <xref
82+
linkend="pageheaderdata-table">. The first two fields deal with WAL
83+
related stuff. This is followed by three 2-byte integer fields
84+
(<firstterm>lower</firstterm>, <firstterm>upper</firstterm>, and
85+
<firstterm>special</firstterm>). These represent byte offsets to the start
86+
of unallocated space, to the end of unallocated space, and to the start of
87+
the special space.
88+
89+
</para>
90+
91+
<table tocentry="1" id="pageheaderdata-table">
92+
<title>PageHeaderData Layout</title>
93+
<titleabbrev>PageHeaderData Layout</titleabbrev>
94+
<tgroup cols="4">
95+
<thead>
96+
<row>
97+
<entry>Field</entry>
98+
<entry>Type</entry>
99+
<entry>Length</entry>
100+
<entry>Description</entry>
101+
</row>
102+
</thead>
103+
<tbody>
104+
<row>
105+
<entry>pd_lsn</entry>
106+
<entry>XLogRecPtr</entry>
107+
<entry>6 bytes</entry>
108+
<entry>LSN: next byte after last byte of xlog</entry>
109+
</row>
110+
<row>
111+
<entry>pd_sui</entry>
112+
<entry>StartUpID</entry>
113+
<entry>4 bytes</entry>
114+
<entry>SUI of last changes (currently it's used by heap AM only)</entry>
115+
</row>
116+
<row>
117+
<entry>pd_lower</entry>
118+
<entry>LocationIndex</entry>
119+
<entry>2 bytes</entry>
120+
<entry>Offset to start of free space.</entry>
121+
</row>
122+
<row>
123+
<entry>pd_upper</entry>
124+
<entry>LocationIndex</entry>
125+
<entry>2 bytes</entry>
126+
<entry>Offset to end of free space.</entry>
127+
</row>
128+
<row>
129+
<entry>pd_special</entry>
130+
<entry>LocationIndex</entry>
131+
<entry>2 bytes</entry>
132+
<entry>Offset to start of special space.</entry>
133+
</row>
134+
<row>
135+
<entry>pd_opaque</entry>
136+
<entry>OpaqueData</entry>
137+
<entry>2 bytes</entry>
138+
<entry>AM-generic information. Currently just stores the page size.</entry>
139+
</row>
140+
</tbody>
141+
</tgroup>
142+
</table>
80143

81-
<row>
82-
<entry>PageHeaderData</entry>
83-
</row>
144+
<para>
145+
Special space is a region at the end of the page that is allocated at page
146+
initialization time and contains information specific to an access method.
147+
The last 2 bytes of the page header, <firstterm>opaque</firstterm>,
148+
currently only stores the page size. Page size is stored in each page
149+
because frames in the buffer pool may be subdivided into equal sized pages
150+
on a frame by frame basis within a table (is this true? - mvo).
84151

85-
</tbody>
86-
</tgroup>
87-
</table>
152+
</para>
88153

89-
<!--
90-
.\" Running
91-
.\" .q .../bin/dumpbpages
92-
.\" or
93-
.\" .q .../src/support/dumpbpages
94-
.\" as the postgres superuser
95-
.\" with the file paths associated with
96-
.\" (heap or B-tree index) classes,
97-
.\" .q .../data/base/<database-name>/<class-name>,
98-
.\" will display the page structure used by the classes.
99-
.\" Specifying the
100-
.\" .q -r
101-
.\" flag will cause the classes to be
102-
.\" treated as heap classes and for more information to be displayed.
103-
-->
154+
<para>
104155

105-
<para>
106-
The first 8 bytes of each page consists of a page header
107-
(PageHeaderData).
108-
Within the header, the first three 2-byte integer fields
109-
(<firstterm>lower</firstterm>,
110-
<firstterm>upper</firstterm>,
111-
and
112-
<firstterm>special</firstterm>)
113-
represent byte offsets to the start of unallocated space, to the end
114-
of unallocated space, and to the start of <firstterm>special space</firstterm>.
115-
Special space is a region at the end of the page that is allocated at
116-
page initialization time and contains information specific to an
117-
access method. The last 2 bytes of the page header,
118-
<firstterm>opaque</firstterm>,
119-
encode the page size and information on the internal fragmentation of
120-
the page. Page size is stored in each page because frames in the
121-
buffer pool may be subdivided into equal sized pages on a frame by
122-
frame basis within a table. The internal fragmentation information is
123-
used to aid in determining when page reorganization should occur.
124-
</para>
156+
Following the page header are item identifiers
157+
(<firstterm>ItemIdData</firstterm>). New item identifiers are allocated
158+
from the first four bytes of unallocated space. Because an item
159+
identifier is never moved until it is freed, its index may be used to
160+
indicate the location of an item on a page. In fact, every pointer to an
161+
item (<firstterm>ItemPointer</firstterm>, also know as
162+
<firstterm>CTID</firstterm>) created by
163+
<productname>PostgreSQL</productname> consists of a frame number and an
164+
index of an item identifier. An item identifier contains a byte-offset to
165+
the start of an item, its length in bytes, and a set of attribute bits
166+
which affect its interpretation.
125167

126-
<para>
127-
Following the page header are item identifiers
128-
(<firstterm>ItemIdData</firstterm>).
129-
New item identifiers are allocated from the first four bytes of
130-
unallocated space. Because an item identifier is never moved until it
131-
is freed, its index may be used to indicate the location of an item on
132-
a page. In fact, every pointer to an item
133-
(<firstterm>ItemPointer</firstterm>)
134-
created by <productname>PostgreSQL</productname> consists of a frame number and an index of an item
135-
identifier. An item identifier contains a byte-offset to the start of
136-
an item, its length in bytes, and a set of attribute bits which affect
137-
its interpretation.
138-
</para>
168+
</para>
139169

140-
<para>
141-
The items themselves are stored in space allocated backwards from
142-
the end of unallocated space. Usually, the items are not interpreted.
143-
However when the item is too long to be placed on a single page or
144-
when fragmentation of the item is desired, the item is divided and
145-
each piece is handled as distinct items in the following manner. The
146-
first through the next to last piece are placed in an item
147-
continuation structure
148-
(<firstterm>ItemContinuationData</firstterm>).
149-
This structure contains
150-
itemPointerData
151-
which points to the next piece and the piece itself. The last piece
152-
is handled normally.
153-
</para>
170+
<para>
171+
172+
The items themselves are stored in space allocated backwards from the end
173+
of unallocated space. The exact structure varies depending on what the
174+
table is to contain. Sequences and tables both use a structure named
175+
<firstterm>HeapTupleHeaderData</firstterm>, describe below.
176+
177+
</para>
178+
179+
<para>
180+
181+
The final section is the "special section" which may contain anything the
182+
access method wishes to store. Ordinary tables do not use this at all
183+
(indicated by setting the offset to the pagesize).
184+
185+
</para>
186+
187+
<para>
188+
189+
All tuples are structured the same way. A header of around 31 bytes
190+
followed by an optional null bitmask and the data. The header is detailed
191+
below in <xref linkend="heaptupleheaderdata-table">. The null bitmask is
192+
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in the
193+
<firstterm>t_infomask</firstterm>. If it is present it takes up the space
194+
between the end of the header and the beginning of the data, as indicated
195+
by the <firstterm>t_hoff</firstterm> field. In this list of bits, a 1 bit
196+
indicates not-null, a 0 bit is a null.
197+
198+
</para>
199+
200+
<table tocentry="1" id="heaptupleheaderdata-table">
201+
<title>HeapTupleHeaderData Layout</title>
202+
<titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
203+
<tgroup cols="4">
204+
<thead>
205+
<row>
206+
<entry>Field</entry>
207+
<entry>Type</entry>
208+
<entry>Length</entry>
209+
<entry>Description</entry>
210+
</row>
211+
</thead>
212+
<tbody>
213+
<row>
214+
<entry>t_oid</entry>
215+
<entry>Oid</entry>
216+
<entry>4 bytes</entry>
217+
<entry>OID of this tuple</entry>
218+
</row>
219+
<row>
220+
<entry>t_cmin</entry>
221+
<entry>CommandId</entry>
222+
<entry>4 bytes</entry>
223+
<entry>insert CID stamp</entry>
224+
</row>
225+
<row>
226+
<entry>t_cmax</entry>
227+
<entry>CommandId</entry>
228+
<entry>4 bytes</entry>
229+
<entry>delete CID stamp</entry>
230+
</row>
231+
<row>
232+
<entry>t_xmin</entry>
233+
<entry>TransactionId</entry>
234+
<entry>4 bytes</entry>
235+
<entry>insert XID stamp</entry>
236+
</row>
237+
<row>
238+
<entry>t_xmax</entry>
239+
<entry>TransactionId</entry>
240+
<entry>4 bytes</entry>
241+
<entry>delete XID stamp</entry>
242+
</row>
243+
<row>
244+
<entry>t_ctid</entry>
245+
<entry>ItemPointerData</entry>
246+
<entry>6 bytes</entry>
247+
<entry>current TID of this or newer tuple</entry>
248+
</row>
249+
<row>
250+
<entry>t_natts</entry>
251+
<entry>int16</entry>
252+
<entry>2 bytes</entry>
253+
<entry>number of attributes</entry>
254+
</row>
255+
<row>
256+
<entry>t_infomask</entry>
257+
<entry>uint16</entry>
258+
<entry>2 bytes</entry>
259+
<entry>Various flags</entry>
260+
</row>
261+
<row>
262+
<entry>t_hoff</entry>
263+
<entry>uint8</entry>
264+
<entry>1 byte</entry>
265+
<entry>length of tuple header. Also offset of data.</entry>
266+
</row>
267+
</tbody>
268+
</tgroup>
269+
</table>
270+
271+
<para>
272+
273+
All the details may be found in src/include/storage/bufpage.h.
274+
275+
</para>
276+
277+
<para>
278+
279+
Interpreting the actual data can only be done with information obtained
280+
from other tables, mostly <firstterm>pg_attribute</firstterm>. The
281+
particular fields are <firstterm>attlen</firstterm> and
282+
<firstterm>attalign</firstterm>. There is no way to directly get a
283+
particular attribute, except when there are only fixed width fields and no
284+
NULLs. All this trickery is wrapped up in the functions
285+
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
286+
and <firstterm>heap_getsysattr</firstterm>.
287+
288+
</para>
289+
<para>
154290

291+
To read the data you need to examine each attribute in turn. First check
292+
whether the field is NULL according to the null bitmap. If it is, go to
293+
the next. Then make sure you have the right alignment. If the field is a
294+
fixed width field, then all the bytes are simply placed. If it's a
295+
variable length field (attlen == -1) then it's a bit more complicated,
296+
using the variable length structure <firstterm>varattrib</firstterm>.
297+
Depending on the flags, the data may be either inline, compressed or in
298+
another table (TOAST).
299+
300+
</para>
155301
</chapter>

0 commit comments

Comments
 (0)