Skip to content

Commit 7c360d6

Browse files
Jan WieckJan Wieck
authored andcommitted
Added documentation for the new interface between the buffer manager
and the cache replacement strategy as well as a description of the ARC algorithm and the special tailoring of that done for PostgreSQL. Jan
1 parent 320a138 commit 7c360d6

File tree

1 file changed

+153
-1
lines changed
  • src/backend/storage/buffer

1 file changed

+153
-1
lines changed

src/backend/storage/buffer/README

Lines changed: 153 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.4 2003/10/31 22:48:08 tgl Exp $
1+
$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.5 2003/11/14 04:32:11 wieck Exp $
22

33
Notes about shared buffer access rules
44
--------------------------------------
@@ -95,3 +95,155 @@ concurrent VACUUM. The current implementation only supports a single
9595
waiter for pin-count-1 on any particular shared buffer. This is enough
9696
for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a
9797
single relation anyway.
98+
99+
100+
Buffer replacement strategy interface:
101+
102+
The two files freelist.c and buf_table.c contain the buffer cache
103+
replacement strategy. The interface to the strategy is:
104+
105+
BufferDesc *
106+
StrategyBufferLookup(BufferTag *tagPtr, bool recheck)
107+
108+
This is allways the first call made by the buffer manager
109+
to check if a disk page is in memory. If so, the function
110+
returns the buffer descriptor and no further action is
111+
required.
112+
113+
If the page is not in memory, StrategyBufferLookup()
114+
returns NULL.
115+
116+
The flag recheck tells the strategy that this is a second
117+
lookup after flushing a dirty block. If the buffer manager
118+
has to evict another buffer, he will release the bufmgr lock
119+
while doing the write IO. During this time, another backend
120+
could possibly fault in the same page this backend is after,
121+
so we have to check again after the IO is done if the page
122+
is in memory now.
123+
124+
BufferDesc *
125+
StrategyGetBuffer(void)
126+
127+
The buffer manager calls this function to get an unpinned
128+
cache buffer who's content can be evicted. The returned
129+
buffer might be empty, clean or dirty.
130+
131+
The returned buffer is only a cadidate for replacement.
132+
It is possible that while the buffer is written, another
133+
backend finds and modifies it, so that it is dirty again.
134+
The buffer manager will then call StrategyGetBuffer()
135+
again to ask for another candidate.
136+
137+
void
138+
StrategyReplaceBuffer(BufferDesc *buf, Relation rnode,
139+
BlockNumber blockNum)
140+
141+
Called by the buffer manager at the time it is about to
142+
change the association of a buffer with a disk page.
143+
144+
Before this call, StrategyBufferLookup() still has to find
145+
the buffer even if it was returned by StrategyGetBuffer()
146+
as a candidate for replacement.
147+
148+
After this call, this buffer must be returned for a
149+
lookup of the new page identified by rnode and blockNum.
150+
151+
void
152+
StrategyInvalidateBuffer(BufferDesc *buf)
153+
154+
Called from various parts to inform that the content of
155+
this buffer has been thrown away. This happens for example
156+
in the case of dropping a relation.
157+
158+
The buffer must be clean and unpinned on call.
159+
160+
If the buffer associated with a disk page, StrategyBufferLookup()
161+
must not return it for this page after the call.
162+
163+
void
164+
StrategyHintVacuum(bool vacuum_active)
165+
166+
Because vacuum reads all relations of the entire database
167+
through the buffer manager, it can greatly disturb the
168+
buffer replacement strategy. This function is used by vacuum
169+
to inform that all subsequent buffer lookups are caused
170+
by vacuum scanning relations.
171+
172+
173+
Buffer replacement strategy:
174+
175+
The buffer replacement strategy actually used in freelist.c is a
176+
version of the Adaptive Replacement Cache (ARC) special tailored for
177+
PostgreSQL.
178+
179+
The algorithm works as follows:
180+
181+
C is the size of the cache in number of pages (conf: shared_buffers)
182+
ARC uses 2*C Cache Directory Blocks (CDB). A cache directory block
183+
is allwayt associated with one unique file page and "can" point to
184+
one shared buffer.
185+
186+
All file pages known in by the directory are managed in 4 LRU lists
187+
named B1, T1, T2 and B2. The T1 and T2 lists are the "real" cache
188+
entries, linking a file page to a memory buffer where the page is
189+
currently cached. Consequently T1len+T2len <= C. B1 and B2 are
190+
ghost cache directories that extend T1 and T2 so that the strategy
191+
remembers pages longer. The strategy tries to keep B1len+T1len and
192+
B2len+T2len both at C. T1len and T2 len vary over the runtime
193+
depending on the lookup pattern and its resulting cache hits. The
194+
desired size of T1len is called T1target.
195+
196+
Assuming we have a full cache, one of 5 cases happens on a lookup:
197+
198+
MISS On a cache miss, depending on T1target and the actual T1len
199+
the LRU buffer of T1 or T2 is evicted. Its CDB is removed
200+
from the T list and added as MRU of the corresponding B list.
201+
The now free buffer is replaced with the requested page
202+
and added as MRU of T1.
203+
204+
T1 hit The T1 CDB is moved to the MRU position of the T2 list.
205+
206+
T2 hit The T2 CDB is moved to the MRU position of the T2 list.
207+
208+
B1 hit This means that a buffer that was evicted from the T1
209+
list is now requested again, indicating that T1target is
210+
too small (otherwise it would still be in T1 and thus in
211+
memory). The strategy raises T1target, evicts a buffer
212+
depending on T1target and T1len and places the CDB at
213+
MRU of T2.
214+
215+
B2 hit This means the opposite of B1, the T2 list is probably too
216+
small. So the strategy lowers T1target, evicts a buffer
217+
and places the CDB at MRU of T2.
218+
219+
Thus, every page that is found on lookup in any of the four lists
220+
ends up as the MRU of the T2 list. The T2 list therefore is the
221+
"frequency" cache, holding frequently requested pages.
222+
223+
Every page that is seen for the first time ends up as the MRU of
224+
the T1 list. The T1 list is the "recency" cache, holding recent
225+
newcomers.
226+
227+
The tailoring done for PostgreSQL has to do with the way, the
228+
query executor works. A typical UPDATE or DELETE first scans the
229+
relation, searching for the tuples and then calls heap_update() or
230+
heap_delete(). This causes at least 2 lookups for the block in the
231+
same statement. In the case of multiple matches in one block even
232+
more often. As a result, every block touched in an UPDATE or DELETE
233+
would directly jump into the T2 cache, which is wrong. To prevent
234+
this the strategy remembers which transaction added a buffer to the
235+
T1 list and will not promote it from there into the T2 cache during
236+
the same transaction.
237+
238+
Another specialty is the change of the strategy during VACUUM.
239+
Lookups during VACUUM do not represent application needs, so it
240+
would be wrong to change the cache balance T1target due to that
241+
or to cause massive cache evictions. Therefore, a page read in to
242+
satisfy vacuum (not those that actually cause a hit on any list)
243+
is placed at the LRU position of the T1 list, for immediate
244+
reuse. Since Vacuum usually requests many pages very fast, the
245+
natural side effect of this is that it will get back the very
246+
buffers it filled and possibly modified on the next call and will
247+
therefore do it's work in a few shared memory buffers, while using
248+
whatever it finds in the cache already.
249+

0 commit comments

Comments
 (0)